I have Yacy and Pihole running on the same device now. May cure the problem of slow crawling after a while

I put a list of forums some 478 url’s today to test my yacy with pihole running on the same pc.

I started the crawl of the sites at a depth of 1 this was ok and fast for a short time, it would slow down to about 50 ppm.
I started to check the logs on yacy and the query page on the pihole to black list problem sites.
There where a number of sites that where asking for 10 second crawl delay so they got blacklisted in one way or another.
I kept blacklisting sites and restarting the crawler a many times.
I also was clearing the webcache and robots cache.
It took me about 2 hours to have the crawl finish properly.

Its been crawling for about 4 hour now so far with my internet connection maxed out. (see pic).
The CPU in the Notebook im using is an i5 released 12 years ago with a HDD.

These are the blocklists I have created and shared, some just today.
See https://pi-hole.net/ for info on installing one.
There is a good forum as well. https://discourse.pi-hole.net/

The Blocklists.
To be entered to the YaCy peer blocklist page.

To be added to PiHole adlists.

Must be entered one by one to PiHoles domain Blacklist there Regex.

To be added to PiHole adlists.

The hosts file used on the computer.

My yacy phole pic

You can try the latest version the changes seems to improve crawl speed.

My Repo is update now at https://github.com/smokingwheels/YaCy

The main source here https://github.com/yacy/yacy_search_server

I did an experiment in 2017 with yacy and a raspberry PI 3 B there is a hosts file listing there.
I found this information on a YaCy search engine I have.
The hosts file listing may help the current project.

See https://forums.raspberrypi.com/viewtopic.php?f=63&t=194208&p=1216233

meanwhile I added a dns lookup throttling which reduces the number of requests once there are more than 50 simultaneous requests. I hope this will also protect routers at home as well against request flooding.

1 Like

Works good.
The Crawler filer works good to.

Here is a better pic.

Found possible error need dos2unix to fix.

Github for windows changes line ending so you need dos2unix.

Will test soon.

from your pi-hole screenshot it looks like the throttling is a little bit too strong? maybe I reduce it a little bit?

Sure go ahead will test.
Note I have 5 piholes running at the same time.
I have the concurrent 150 warning a lot less now.
My old router could only handle 75 queries a second my new one about 100 quieres a second.
The current settings takes 10 min to cache dns on a list of 480 sites not normal sort of a crawl?

There is only about 1 mbs left on my connection for DNS when starting a crawl.
Maybe do try all the lookup first then start the crawler?

here is my list of what i’m crawling

If anyone wants to use my lists in there Pihole you are welcome to try them.
You copy the list to the Piholes /var/www/html folder and point the adlist to it eg

You do it for each file on the pihole.

I don’t have all of them enabled. See pics.

There is a good forum if you have any problems. https://discourse.pi-hole.net/

There is an update to pihole from my early test it looks like it has improved.

Load Testing 1 before I upgrade the remaining 4.

2 out of 5 piholes have a problem no time to fix until after the 25th