Domain list for easier search bootstrapping

tb0hdan · 12 January 2020 23:55

Hey guys,

I’ve been running Yacy on and off for quite some time. Every time I’ve installed
new node I had to search for domains to start crawl with all over again.

After some consideration I’ve decided to upload resulting list to Github: https://github.com/tb0hdan/domains

I hope this will be useful.

Orbiter · 20 January 2020 11:19

Hi tb0hdan,

thats a huge list!

tb0hdan · 22 January 2020 18:50

Great thanks,

It’s far from being high quality but I’m doing my best to have it properly
sorted and updated.

tb0hdan · 4 June 2020 14:32

Update:

TLD kinds: 1507
Country TLDs: 244
Generic TLDs: 1263
Total domains in dataset: 220,011,651

Orbiter · 9 June 2020 14:32

wow! tweeted this…

zooom · 18 August 2020 12:24

Hi Bohdan, good work!

Did you really crawl the stuff yourself? Pretty many of them time out and there is tons of subdomains, esp. in the porn sector.

I downloaded similar stuff from domainlists.io. To filter out the ones with content behind, I use “subbrute” and “massdns” together with a perl script:

while(<>) {
if (/^;; ANSWER/) {
    $_ = <>;    
    if (/([a-z0-9\-]+)\.(\w+)\. (\d+) IN A (.+)/) {
        print "$1.$2\t$4\n";
    }
}

}

called by:

./bin/massdns -r resolvers.txt $1 | perl grep_ipaddrs.pl > ipaddrs.txt

where $1 is the domain list

TheHolm · 24 August 2020 07:02

tb0hdan was scanning for domain names, not web services. So right approach will be to feed list to nmap to find out what servers are responding on port 80. And than feed only them to YaCY,

I just filtered all names staring with “www.” and craw them. Getting pretty good results.

tb0hdan · 30 August 2020 02:51

@zooom Yes, I do. Crawler code itself is opensource - https://github.com/tb0hdan/domains-crawler - just file reader and TLDs used to configure it are not. There are bugs (as always) but I’m working on getting them fixed.

I’ve used additional sources as crawler input to speed up dataset growth, all of them are listed in dataset readme.

Regarding subdomains - there are some limits in place, still I wanted to have those as well to allow for others to have doorway detection. I’m working on autovacuum process that
will filter invalid (i.e. expired) domain names.

Regarding domainlists.io - I strongly believe that domain list should be publicly available and not sold.

@TheHolm Yes, your approach with nmap seems to be the best so far.

vasyugan · 25 October 2020 22:11

What is the recommended way of using the list?

tb0hdan · 3 November 2020 18:31

Hi there, @vasyugan

Pick TLD you like, for this example it would be https://dataset.domainsproject.org/afghanistan/domain2multi-af.txt
Use this command to convert domain list to URL list: cat domain2multi-af.txt| awk '{print "http://"$1}' > /tmp/domain2multi-af.txt.urls
Go to http://127.0.0.1:8090/CrawlStartExpert.html
Pick From File (enter a path within your local file system)
Point to URL list you’ve generated - in this case - /tmp/domain2multi-af.txt.urls
Hit Start New Crawl Job button

Yacy will automatically skip hosts that are not available for crawling.

You can browse dataset here: https://dataset.domainsproject.org/

Orbiter · 8 February 2021 13:06

Here is a funny thing: I once applied to the Deutsche Nationalbibliothek to make a german domain search engine (that was a government request). They did not accept my proposal, but as a reference for a good harvesting start point I submitted a 1095-pages, 477119-domain start list in a pdf (a pdf was requested).

You can download that document here:

… maybe I did not get the job because the list started with “0-24-sex.de, 0-strom.de, 0.bild.poppen.de”??
(do not click on the links)

tb0hdan · 16 February 2021 22:56

Thanks, @Orbiter

Here are relevant commands to extract domain list from that PDF:

pdftotext Top-Level-Domain-Harvesting-DE-Seedlist.pdf
cat Top-Level-Domain-Harvesting-DE-Seedlist.txt|grep '\.de' |sed 's/\, /\n/g'|egrep '^[a-z|0-9](.+)\.de$' > dotde.txt

I’m going to verify them and import into my dataset.

zooom · 22 September 2021 02:33

Hi Obriter
Thx a lot for the list. I do something similar in Switzerland. Made a proposal to implement YaCy which is still pending.

zooom · 22 September 2021 02:36

Would you be available for a project to implement YaCy?
Cheers M

tb0hdan · 28 September 2021 21:18

Update:

TLD kinds: 1522
Country TLDs: 245
Generic TLDs: 1277
Total domains in dataset: 1,789,946,688

Orbiter · 29 September 2021 23:07

yes, maybe

Orbiter · 29 September 2021 23:07

wow, thats huge!

mirazsarker · 13 April 2022 22:43

tb0hdan
bro the last updated list is 7 month ago when u will update new?

mirazsarker · 23 April 2022 07:57

sir i got your dataset but there are losts of hostnames i dont want hostnames i want only domain and subdomains would you help me to extract

JessicaBosworth · 23 May 2022 06:55

Cool, big list.