Augmenting YaCy with Apache Tika?

deimos · 24 June 2020 04:08

Hi,

I’m running YaCy in a few modes, public and intranet for a year or so. It’s been great as an indexing/search server for indexing public content.
I’d like to have YaCy index inside documents, like DOCX, PDF, JPG, etc, basically everything Apache Tika does.
Is there a way to include Apache Tika for unknown file types?

I see lots of errors like

Podcast.csv' file extension is not supported and indexing of linked non-parsable documents is disabled.

Where the content of the csv is pure text, and it should be easy to parse/index/categorize and make searchable.

I experimented with the Open Semantic Search project, https://www.opensemanticsearch.org/, but it’s not quite the same.

Thoughts? Pointers/advice?

Thanks!

Orbiter · 7 November 2020 23:18

csv files should be indexable, YaCy should be able to index that. Do you have a link to that file?

About Apache Tika: we cover afaik all parsers they have, using the same library that tika is using. The main difference between tika and YaCy parsers is that YaCy generate a much richer metadata profile than tike, so using tike is not an option at all.