Best documentation to understand Crawling, Indexing and Ranking

Hello,

I am building an online school for digital maketers by using only Free and Open Source technologies (true story). I just learnt about YaCy, and I really think it will be a great technology to show to my students what is a search engine from the inside. And to me the first step to learn SEO and SEA as well.

I think I can succeed in installing it, but I am not sure I could find easily how YaCy crawls, index and rank websites.

-> Can anyone send me a link to the documentation that she or he consider as the best to understand how that work?

I plan to open this course around next September/October, it will be for free and accessible to everyone who has an internet connection.

Cheers,

1 Like

There is a YaCy Tutorial Youtube Channel which also has a video about crawling:

Thank you very much for your time, I went through the first videos and in no time I was running my first instance of YaCy (and I am not a developer at all).
I think I understood how the crawler is working so as the index.
Regarding the ranking part, I just would like to double check something with you.
The part that you call: “Solr Boosts”, is it where we can control the ranking algorithm?
For example, by default the title tag is noted as 5.0. Does this mean that the word included within this tag is 5 times higher that the one of the author tag?

Once more thank you very much for your time.

in principle - yes. But under the hood it might be not so easy. The boost factor multiplies a metric called TF*IDF, where TF = term frequency, the number of occurrences of matching terms (… ‘sometimes’ divided by the number of all words in the document) and IDF the inverse document frequency, the count of documents with matching terms.

That means that boost factors have more effect on texts with ‘typically’ less words.
This desciption is still very flat, details can be found in the lucene (which is part of solr) source code documentation at https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

1 Like

Thank you, this is exactly the answer that I needed in order to move forward.

1 Like