lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clemens Wyss DEV <>
Subject AW: determine "big" documents in the index?
Date Fri, 08 May 2015 14:30:26 GMT
On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is this "too much"?
Another field (the "single word suggestion") has 2'156'218 terms.

-----Urspr√ľngliche Nachricht-----
Von: Clemens Wyss DEV [] 
Gesendet: Freitag, 8. Mai 2015 15:54
Betreff: determine "big" documents in the index?

Context: Solr/Lucene 5.1

Is there a way to determine documents that occupy alot "space" in the index. As I don't store
any fields that have text, it must be the terms extracted from the documents occupying the

So my question is: which documents occupy a most space in the inverted index?

I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's
the extarcted text is not really text but "binary blobs". In order to verify this (and possibly
omit these pdfs) I hope to get some hints of Solr/Lucene ;)
View raw message