Information/String Crawling

ChristianPyth · 17 December 2020 08:43

Hello,

I looking to create searchable data base for such and such keyword but I have hard time to configure crawler to look up for certain keywords, there is any tutorial tip how to crawl for certain strings/words/names from such and such site?

zooom · 18 December 2020 07:04

Crawling means scanning any url’s page content. You must tell the crawler from where to start.
Maybe you mean “scraping” content from search results when searching for a keyword in a search engine. You need an already existing search engine (with an index behind) to search for your keywords. To automate this you can use e.g. the yacy json interface:

http://localhost:8090/yacysearch.json?query=www

Tom_Booth · 25 December 2020 01:45

What is the purpose? What key words?

It theoretically could be quite useful IMO, if a web crawler could be configured in such a way that it could be turned loose to crawl the internet but only index those pages which contain some particular key word or phrase, in order to build a specialized database of one sort or another.

In other words, rather than filling the local index with irrelevant data and “junk” websites the crawler could look at the content, then either index the site or purge the data and move on depending on whether or not the page met certain criteria, like does it mention the word YaCy, or “Quantum physics”, or photovoltaic, or Chattanooga.

Makes a lot of sense to me.

Of course one person’s junk is another’s treasure as they say.

zooom · 26 December 2020 16:04

Remember, that “Web” is a fairy tale told by G*** and similar to promote the “famous” Larry P. algorithm.

There was a time, where “banner exchanges” were used to line unrelated pages together. Today we have billions of little islands of websites which form little webs by themselves, but mostly unrelated. This is, why G*** now tracks all clicks - and offers public DNS - to find out, where the content is (what users are searching for, or are entering into the URL line).