YaCy deletes index content without a warning

zooom · 15 January 2021 15:35

Hi all,

I was wondering why it is so hard to crawl a list of several hundred thousand domains explicitely.
The reason: If a domain is not available during a crawl, or answers too slow, the domain is simply deleted from the index. The flag “never delete anything” seems to be ignored.

The fact, that my unbound DNS fucked up today during crawling let YaCy decide to reduce my index to almost zero. Lot of work of the last weeks is gone out of the door. This is not funny.

I administer my domains (startrurls) in an external database and feed the url EXPORTs periodically to check which domain was crawled next and which was not yet crawled. A big part of the domains on the list gets lost every other day which makes the task of crawling a huge list of domains impossible as you never know, what will be deleted next.

Has anyone observed a similar behavior? How can I stop this?

sanderreiding · 17 January 2021 18:34

Today I also saw approx a 100k drop in the index. And I can fully agree this ain’t fun so I would also like to know if I missed something in the configuration.

Lucky I didn’t have much url’s to search for but still it’s some work to figure out which domains are lost, is there a log somewhere ?

zooom · 26 January 2021 09:41

All the logs resides in DATA/LOG/yacy*.log I always do
tail -f DATA/LOG/yacy00.log

I came to the impression that most of YaCy’s unexplainable behavior is a result of unexpected shutdowns.

What’s lost? I periodically export the list of domains into MySql to be able to do such kind of analysis.
I would say like this: It is pretty hard to fill YaCy / Solr with a dedicated list of domains / urls (as described in my other comments). Above 100k it gets really boring.