(Newbie help) Want to index only specific websites

I am sorry to bother you. (I do know German a bit btw.)

I want a “search engine” for my bookmarks/links I feed it. The way I imagine is:

Me: I randomly came across a helpful website/article. Here my search engine, eat it: https://example.com (Ideally through a browser extension)

Search engine: Ok, now forget about it.

Me some time later: I need an example domain, but google is giving me random garbage. Yo, my search engine, do you remember me bookmarking anything like that before?

Search engine: Of course I do, how about https://example.com?

Me: Exactly what I wanted, thanks

Is it possible to make this setup using YaCy? (I would prefer to shut down all of the peer aspect of it and prevent websites other then the ones I specified from finding their way in.)

Currently I am storing all the helpful stuff I find in a Notion database (bully me for that), but I was wondering if there is a better solution. I found archivebox[dot]io, but I question it’s indexing abilities. So what about YaCy? Or do you have a better recommendation for this case?

I am a total noob when it comes to self hosted anything.

Thanks and Ich wünsche ihnen einen guten Tag

1 Like

In my setup, I use bookmarks exported from browser as a feed for YaCy indexer. Then I got indexed everything i bookmarked.

In this setup, I use Crawl Depth of 1 (only links I bookmark are indexed), but you can use Depth of 2 and all the links from bookmarked pages would be indexed as well.

When I know, I’ll be interesed in some site in general, I just crawl the whole domain. It may be lenghty in some cases (small blog is in average several thousands of pages, small magazine is like 20 000 to 100 000, New York Times is around 15 000 000 with the whole archive).

There is also a function of “Heuristic” in search-result: (Search Portal Integration> Ranking and Heuristic > Heuristic > shallow crawl on all displayed search results = yes). With this settings, after searching and showing the results, all links from resulting pages are crawled.
I personally don’t use this function, because of garbage and performance, but could be of use for your case.

Yeah, that’s definitely possible: First Steps > Use Case & Account > Search portal for your own pages.

1 Like