I’m very fond of contributing to a public and shared index of crawled websites and Yacy seems a very good project. I still have to check all the requirements to install it.
What I would like to know is:
- In comparison to Google’s index as a benchmark, how much do you think the share of Yacy is?
- In average, do you have an estimate of how many sites crawled for each 1GB of space occupied by the index?
- How’s the vitality of the project, i.e., the user acceptance over time?
- Aren’t the web crawlers stopped by the web servers?
I have indexed about a half dozen domains so far and only one seems to have rejected my crawl. I think it might depend on your settings
System Administration > Performance Settings of Busy Queues
and increase the delay value for Local Crawl to at least 3 seconds, otherwise some sites might block your crawl.
I have indexed 1 website with about 93.000 pages (including about 75.000 external pages e.g. Twitter). Traffic generated 16GB, indexing duration ~ 8h
Index size is 4.8GB - HTCache is 4.4GB
I’m a beginner with YaCy though.