lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: determine "big" documents in the index?
Date Fri, 08 May 2015 16:55:10 GMT
bq: has 30'860'099 terms. Is this "too much"

Depends on how you indexed it. If you used shingles, then maybe, maybe
not. If you just do normal text analysis, it's suspicious to say the
least. There are about 300K words in the English language and you have
100X that. So either
1> you have a lot of legitimately unique terms, say part numbers,
SKUs, etc. digits analyzed as text, whatever.
2> you have a lot of garbage in your input. OCR is notorious for this,
as are binary blobs.

The TermsComponent is your friend, it'll allow you to get an idea of
what the actual terms are, it does take a bit of poking around though.

There's no good way I know of to know which docs are taking up space
in the index. What I'd probably do is use Tika in a SolrJ client and
look at the data as I sent it, here's a place to start:
https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <clemensdev@mysign.ch> wrote:
> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is this "too
much"?
> Another field (the "single word suggestion") has 2'156'218 terms.
>
>
>
> -----Urspr√ľngliche Nachricht-----
> Von: Clemens Wyss DEV [mailto:clemensdev@mysign.ch]
> Gesendet: Freitag, 8. Mai 2015 15:54
> An: solr-user@lucene.apache.org
> Betreff: determine "big" documents in the index?
>
> Context: Solr/Lucene 5.1
>
> Is there a way to determine documents that occupy alot "space" in the index. As I don't
store any fields that have text, it must be the terms extracted from the documents occupying
the space.
>
> So my question is: which documents occupy a most space in the inverted index?
>
> Context:
> I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some
pdf's the extarcted text is not really text but "binary blobs". In order to verify this (and
possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)

Mime
View raw message