Augmenting YaCy with Apache Tika?


I’m running YaCy in a few modes, public and intranet for a year or so. It’s been great as an indexing/search server for indexing public content.
I’d like to have YaCy index inside documents, like DOCX, PDF, JPG, etc, basically everything Apache Tika does.
Is there a way to include Apache Tika for unknown file types?

I see lots of errors like

Podcast.csv' file extension is not supported and indexing of linked non-parsable documents is disabled. 

Where the content of the csv is pure text, and it should be easy to parse/index/categorize and make searchable.

I experimented with the Open Semantic Search project,, but it’s not quite the same.

Thoughts? Pointers/advice?


csv files should be indexable, YaCy should be able to index that. Do you have a link to that file?

About Apache Tika: we cover afaik all parsers they have, using the same library that tika is using. The main difference between tika and YaCy parsers is that YaCy generate a much richer metadata profile than tike, so using tike is not an option at all.