I’m relatively new to YaCy, and would like to know if it already is or can be configured to crawl “politely”, as detailed in these sources:
I want to make sure that I don’t cause the owners of other websites any trouble with my instance.
I’m relatively new to YaCy, and would like to know if it already is or can be configured to crawl “politely”, as detailed in these sources:
I want to make sure that I don’t cause the owners of other websites any trouble with my instance.
I followed tweaks from this link:
YaCy’s tendency to be very unpolite when it crawls websites is another problem. It does respect a crawl-delay value if a website has set one in it’s robots.txt. Websites with no robots.txt asking for a limit get none; YaCy will effectively behave like a denial of service attack. A look at source/net/yacy/cora/protocol/ClientIdentification.java reveals some very small limits. These can be increased by changing a few lines:
This file is also where a custom user-agent should be set. YaCy does have a configuration option for it but that setting isn’t actually used.
Sorry that I took so long to get back to you. Thank you for informing me of this. I will be sure to try out these changes as soon as possible.