DNS Loading and Crawling experiment

roamn · 9 June 2023 08:35

Text From Github issue https://github.com/yacy/yacy_search_server/issues/578#
Index now near 9 million I am doing a backup.

I hired a 4 CPU cloud to gather main content of my Yacy search engine using only an Android phone.

The Cloud had no swap space defined and had to keep rebooting the cloud server every 3 - 4 days the whole time it was there.

Later on.
I Transferred 150 GB’s to a Windows PC while I reconnected and fixed as well setup my Linux systems back up.

Yacy’s performance on windows.
It ran in ok but a bit slow but its only on a 5400 rpm notebook HDD drive.

I SSH transferred to Linux on New SSD for testing and I could not get yacy to start.
I changed all the locations in the yacy.init still would not start after running in windows.
I ended up in the end taking a yacy.init from another linux yacy install and overwriting the copied windows one and it started.
I checked the index and it was ok, took some time.

I did an Export from 1.924 and an import to the latest github version and have been testing that on a separate SSD in my old I7 PC.
Note: The DHT WORDS was around 6 million and when I had imported it when back to 1.5 million.

The startYACY.sh contents. After experimenting and reading up on minecraft tweaks on the old JAVA Version.

Added java settings seems to help crawling on 10 year old $5 Mainboard with Ubuntu 22.04 Desktop. Also have found if you close all the montors yacy will tend to crawl faster.
Settings tried
JAVA_ARGS=“-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalPacing -XX:ParallelGCThreads=15 -XX:+AggressiveOpts $JAVA_ARGS”;

On new Ubuntu 22.04 install yacy would not start until it had a few removed.

JAVA_ARGS=“-XX:ParallelGCThreads=15 -XX:+AggressiveOpts $JAVA_ARGS”;

crawling rate set to 30000 NBN 25 Connection 25 mbs connection.
Pretty much saturated the connection while testing. ( peak internet traffic 218 GB in one day )

-XX:ParallelGCThreads=15 I have tried 4 8 10 while crawling now testing 10 after crawler stopped.

My startYACY.sh

#!/usr/bin/env sh
JAVA=“which java”
CONFIGFILE=“DATA/SETTINGS/yacy.conf”
LOGFILE=“yacy.log”
PIDFILE=“yacy.pid”
OS=“uname”

#get javastart args
JAVA_ARGS=“-server -Djava.awt.headless=true -Dfile.encoding=UTF-8”;
JAVA_ARGS=“-XX:ParallelGCThreads=15 -XX:+AggressiveOpts $JAVA_ARGS”;
#JAVA_ARGS=“-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails $JAVA_ARGS”;

Have had DNS issues when crawling have made a new blocklist for my pihole.

From Pihole log a DNS typical lookup.

AAAA	iframe.dacast.com		OK (answered by dns.google#53)	CNAME (8.5ms)	
AAAA	iframe.dacast.com		OK (sent to resolver1.opendns.com#53)	N/A	
A	iframe.dacast.com		OK (answered by resolver1.opendns.com#53)	CNAME (4.2ms)	
A	iframe.dacast.com		OK (sent to resolver1.opendns.com#53)	N/A	
AAAA	www.speldvic.org.au		OK (answered by resolver1.opendns.com#53)	CNAME (5.4ms)	
AAAA	www.speldvic.org.au		OK (sent to resolver1.opendns.com#53)	N/A	
A	www.speldvic.org.au		OK (answered by resolver1.opendns.com#53)	CNAME (6.2ms)	
A	www.speldvic.org.au		OK (sent to resolver1.opendns.com#53)	N/A

Added more primary and secondary known DNS servers in Pihole settings. Seems to improve crawling.

Testing Crawling a site at depths of 5 6 7 10 20.

Have invalid DNS requests when crawler slows down and it crashes every 2 to 3 hours.
Since adding the blocklist I generated from the pihole logs over 5 days crawling lastes 6 to 7 hours.

Types of domains DNS requests from Yacy to the pihole log and my some in blocklist when crawler slows down.
When the initial crawl is started the DNS lookup is done there are 30000 DNS requests in 10 min of crawling.

After new blocklist and settings changed I had 40 000 DNS requests in 10 min.

I have had Various types of invalid DNS requests when crawler slows down in the Pihole shows them.

AAAA	dpird.wa.gov.au?subject=	OK (cache)	NXDOMAIN (0.4ms)	
AAAA	dpird.wa.gov.au?subject=	OK (cache)	NXDOMAIN (1.2ms)	
A	dpird.wa.gov.au?subject=	OK (answered by dns.quad9.net#53)	NXDOMAIN (1.2ms)	
A	dpird.wa.gov.au?subject=	OK (sent to dns.quad9.net#53)	N/A

customer_name.atlassian.net"	
charts.gitlab.io<	
www.mumble.info">mumble<	
ayearofsprings.crd.co">a	
maven.google.com"	
toasters.example",	
www.apple.com"	
apptwo.company.com"	
www.yosoygames.com.ar%5c	
www.example-dsp.com',	
www.example-ssp.com',	
buyer2.com',	
www.example-dsp.com'	
www.another-buyer.com'	
www.some-other-ssp.com',	
risc.example.com",
A	gayle bowen (klgates.com)

Crawling speed set at 30000
Had system loan indicator % up at 1200… 8 is normal for my cpu. hit the reset.
Not sure where the fault is happened several times.
Also there was some debug msg you system does not have enough free memory to debug.

When crawling at maximum speed for some time my 6 GB Swap file filled up…now has 32 GB swap space.
Also there was some debug msg you system does not have enough memory to debug.

When crawling maximum speed crawler ques keeps files open like trying to download the zip and other file contents directly.

Also had socket error once or twice when pushing Yacy.

Tried this

Thread Pool Settings:
Crawler Pool 4000
Robots.txt Pool 4000
httpd Session Pool 4000

With new blocklist the crawler was near Zero but there was 4000 items stuck in Thread Pool.
I checked the Local Crawl list on the crawler monitor page and found a discource forum giving the 4 dot loading like keeping the connection open waiting when connecting in browser. I have struck that with a browser on the Nextcloud forum as well.

Also Upseting the DNS Server causes Crawling to come to a dead stop.

Tried upgrading Yacy 1.924 to latest.

Did the Export XML (Rich and full-text Solr data, one document per line in one large xml file, can be processed with shell tools, can be imported with DATA/SURROGATE/in/)
Lost DHT WORDS.

Ended in lost DHT Words.

New System
YaCy version: yacy_v1.926_202304041204_1c0f50985
Uptime: 0 days 00:34
Java version: 1.8.0_362
Processors: 8
Load: 0.98
Threads: 52/14, peak:107, total:430

Thread Dump just after starting yacy with crawl depth 7.

THREADS WITH STATES: BLOCKED

Thread= qtp261650860-104-acceptor-1@1c987884-httpd:8095@7bf9b098{HTTP/1.1, (http/1.1)}{0.0.0.0:8095} id=104 BLOCKED
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:233)
at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:388)
at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:702)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
at java.lang.Thread.run(Thread.java:750)

Possible web cache issue. Seems to go away when not storing to web cache.

Testing on
Started yacy version 1.924/9000

Crawler set to 400 ppm and left running.
Crawler stopped and yacy was not responding. Happened twice over a few days.

Looked in and deleted…
/DATA/INDEX/freeworld/QUEUES/CrawlerCoreStacks
%20www.ischs.org.au-#XxMjNQ.80

Will continue testing index just hit 8 Million from 6.5 million virtual cloud system.

I tried Dnsmasq server to cache the DNS for several days and found it reduces the peak DNS load by 50%.

I then tried installing the pihole on the local machine but am having DNS issues so a backup and fresh install is in order.

Many Thanks for reading.

okybaca · 10 June 2023 12:57

wow! i’m interested! memory & yacy is a huge pain for me! (every 1M of crawled pages increases the memory consumption for ~1GB, i have to increase the amount of RAM constantly, which cannot be done indefinitely)

what these settings do?

roamn · 12 June 2023 20:37

Not too sure there supposed make a MineCraft server Java Version run smoother.

I seem to have less memory consumed when using the settings.
I also have a DNSMasq Server on each Yacy server, the dnsmasq servers ask my pihole’s for DNS info.

I have lots of hosts forcing crawl-delay everywhere in the log.
I am crawling over 480 sites at a depth of 3 to start with.
So yacy is working well with my blocklists by the looks.

Also Monitor the folder contents. I have been deleting strange folders and seems to help crawling.
/DATA/INDEX/freeworld/QUEUES/CrawlerCoreStacks
Just now I had ****************.com turn up so its gets a hosts entry.
0.0.0.0 ****************.com
0.0.0.0 *.com
0.0.0.0 **.com

Might try reducing the 15 and see what effect it has.

Yacy peer at Idle.

I don’t store to web cache when crawling anymore and this error goes away concurrentlog indexwriter is closed.

I also don’t any idea of what size it should be?
I had set it to 8 GB it just filled up I then tried 16 GB and it also just fills up.

roamn · 18 June 2023 09:07

This is the sort of DNS Loading using Yacy Version 1.924/9000 crawling at a depth of 20 - 30. It’s going thru a DNSMasq server load balancing so its nearly double that It’s up around 450 DNS requests a second.

The latest github version is quite a bit slower.
The Bar on the graph is timed at 10 mins when I started a crawl.

I am testing a 1 CPU 1 GB system using 1.0 in most of the maximum system load settings except cleanup.

System Administration

Advanced Settings
Performance Settings of Busy Queues

image1520×426 64.5 KB

I lowered the default 600 Mb’s to 512 Mb’s
I have always run over half the memory space in the machines up until now from what I have read you only allocate half your ram minus the OS.

I hired a 4 CPU 8 GB droplet and Yacy was crashing every 2 to 5 hours.

The * Performance Settings of Busy Queues.
I changed the maximum system load’s to 4 and set the Memory to 4096 MB’s, also not sure if it has to divisible by 2.
Crawling away Uptime: 0 days 10:35

okybaca · 18 June 2023 23:22

System load time is connected with number of CPUs on your system. If you got 4 CPUs, load of 4.0 means you use all of them at 100%. If you got 1 CPU, load of 4.0 means you overload the cpu 4 times. if you got 8 CPUs, load of 4.0 means you use just 50% of full CPU power.

roamn · 14 March 2024 21:05

I have a pihole and a Yacy setup on the same computer in a few places now.
I have noticed there is a high DNS load on the Ubuntu 22.04 Desktop computer, I have been using Ubuntu desktop for many years.

I'm currently using java-17-amazon-corretto-jdk

I’m crawling a discourse forum at a depth of 5 on both servers over the period. Not from the same IP..

DNS requests on Debian 12 server.
Yacy hits the DNS while starting a crawl then stops eg it caches the DNS.

.

Results local crawling Debian server.

.

DNS requests from Yacy on Ubuntu 22.04 desktop.
It constantly hits the DNS while crawling not too sure whats going on. It has trouble indexing my target site.

.

Results local crawling Ubuntu Desktop.
The site I was looking for is not even on the list like in the Debian 12 server pic.

.
What does your Yacy server run on?
Is there a bug with the Linux Ubuntu desktop OS?