I am sharing this for any new users curious about disk space requirements. I’ve found that, with media caching enabled (default), an index of 500,000 documents consumes about 6GB. 1,000,000 roughly 12GB and so on. If you’re findings are any different than mine, please share your crawl settings.
My results > System Status > 1.4 MM documents > 70Gb Disk Space Used.
Im crawling news websites exclusively. Not sure how much that might influence index size. Unfortunately I have included media in my crawls.
I notice from other posts, you seem to have an interest in index size. Are there any reference material concerning content types indexed and index size?
I have learned, after running Yacy for some time, that once you enable Senior peer functionality your peer will accept links (RWI?) from the P2P network to include in local search results. These take less space than when your crawler caches pages locally so that will have a big impact on how much storage gets used.