I’m very fond of contributing to a public and shared index of crawled websites and Yacy seems a very good project. I still have to check all the requirements to install it.
What I would like to know is:
In comparison to Google’s index as a benchmark, how much do you think the share of Yacy is?
In average, do you have an estimate of how many sites crawled for each 1GB of space occupied by the index?
How’s the vitality of the project, i.e., the user acceptance over time?
Aren’t the web crawlers stopped by the web servers?
I have indexed about a half dozen domains so far and only one seems to have rejected my crawl. I think it might depend on your settings
System Administration > Performance Settings of Busy Queues
and increase the delay value for Local Crawl to at least 3 seconds, otherwise some sites might block your crawl.
I have indexed 1 website with about 93.000 pages (including about 75.000 external pages e.g. Twitter). Traffic generated 16GB, indexing duration ~ 8h
Index size is 4.8GB - HTCache is 4.4GB
I’m a beginner with YaCy though.