Hi all,
I was wondering why it is so hard to crawl a list of several hundred thousand domains explicitely.
The reason: If a domain is not available during a crawl, or answers too slow, the domain is simply deleted from the index. The flag “never delete anything” seems to be ignored.
The fact, that my unbound DNS fucked up today during crawling let YaCy decide to reduce my index to almost zero. Lot of work of the last weeks is gone out of the door. This is not funny.
I administer my domains (startrurls) in an external database and feed the url EXPORTs periodically to check which domain was crawled next and which was not yet crawled. A big part of the domains on the list gets lost every other day which makes the task of crawling a huge list of domains impossible as you never know, what will be deleted next.
Has anyone observed a similar behavior? How can I stop this?