Some initial questions

lib20 · 15 January 2022 00:27

Hi all!

I’m very fond of contributing to a public and shared index of crawled websites and Yacy seems a very good project. I still have to check all the requirements to install it.

What I would like to know is:

In comparison to Google’s index as a benchmark, how much do you think the share of Yacy is?
In average, do you have an estimate of how many sites crawled for each 1GB of space occupied by the index?
How’s the vitality of the project, i.e., the user acceptance over time?
Aren’t the web crawlers stopped by the web servers?

Thanks.

Lumberjack · 9 February 2022 03:54

I have indexed about a half dozen domains so far and only one seems to have rejected my crawl. I think it might depend on your settings
System Administration > Performance Settings of Busy Queues
and increase the delay value for Local Crawl to at least 3 seconds, otherwise some sites might block your crawl.

Dude · 10 February 2022 01:09

I have indexed 1 website with about 93.000 pages (including about 75.000 external pages e.g. Twitter). Traffic generated 16GB, indexing duration ~ 8h
Index size is 4.8GB - HTCache is 4.4GB
I’m a beginner with YaCy though.