Happy New year All,
First off, just want to say YaCy seems to be a great tool, easy to setup and provides quick and useful results with the default setup…
However, one issue I’m seeing: the crawler eventually works its way to a mirror site for linux isos. Based on traffic analysis, the crawler seems to be downloading the full iso from each mirror site (each iso file is almost 1GB) . This end up quickly saturating my internet connection and the crawler ppm drops significantly within about 10 minutes of starting the crawler.
This is a tremendous waste of bandwidth, but I can’t seem to find a way to get the crawler to abandon the large downloads and move on to other content… Is there a way to filter iso files (and large archives such as zip files and tar.gz, etc)?
Here’s my current setup:
Currently using the latest git clone from a few days ago (dated 12/30/22); the advanced crawler is started with all defaults (crawl depth = 3, etc), starting url is rsync - ArchWiki