Prioritization of local versus remote search and speed of crawling


I’m new YaCy user and I’m curious how to solve two questions I do have while using it:

  1. I’ve crawled/indexed few sites I’m most interested in but once I try to search over them peer-to-peer search sometime completely ignores local index (local peer 0) and provides completely random answer which is usually complete garbage when taking in mind original question string. Is there any way to make search always consult also local peer?

  2. internet sites I’m most interested in are quite huge, the question is how to make crawler tolerably fast and yet kind of doable. I guess with a speed of 10s of pages per minute man can’t reasonably index whole wikipedia (example). When I increase crawler speed I may end up with speed around 1-2k pages per minute but then the question is if some webserver will not kick me out as a robot. Is crawler able to detect kick out and then limit speed to particular server? I’m also curious if for speedy crawling there is a chance to use remote peers somehow. I’ve enabled remote crawling on my side, but so far has not figured out how to initialize remote crawling. Also does “remote crawling” mean distributed crawling? Last question: does crawler rotate user agents strings to better “confuse” web server?


ad 1) actually this issue is well resolved on pre-releases. I’m testing 1.921/9828 – this is not switched on by default so man needs to go to Portal Configuration -> Remote results resorting – and switch from “On demand, server-side” to " Automated, with JavaScript in the browser for authenticated users only" – this way even if my local peer is slower than remote and its results are more relevant, yet they are sorted/resorted up – as expected.