Crawl an API instead of a website

asennoussi · 12 August 2023 11:56

Hello,
Is it possible to configure YaCy to crawl an API instead of a webpage?
I’m trying to collect data and store them on my file instead of a webpage.

Orbiter · 3 September 2023 09:02

While taking data from an API is not difficult, we cannot do this in all cases because YaCy requires a (unique) URL to present search results with links that a user can click on.

So this means there are possible exceptions for “we cannot crawl APIs” whenever the API would produce content that can be accessed by http as well. The build-in OAI-PMH crawler (library data) is an example for such an exception. Also the wikipedia dump reader is such an exception.

Whenever you find an API that produces no http-accessible content, the first step would be to create a http interface for that data which can then be crawled/loaded with YaCy. This is of course only a solution where you are in control of that API yourself, which is very often the case in a intranet environment that you own or manage yourself.