lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clemens Wyss DEV <>
Subject AW: determine "big" documents in the index?
Date Sat, 09 May 2015 07:11:26 GMT
> If you used shingles
I do:
    <fieldType class="solr.TextField" name="suggest_phrase" positionIncrementGap="100">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>

This is more or less what I do

>2> you have a lot of garbage in your input. 
>OCR is notorious for this,as are binary blobs.
What does the AutodetectParser return in case of an OCR-Pdf? Can I "detect"/omit an OCR pdf?

-----Urspr√ľngliche Nachricht-----
Von: Erick Erickson [] 
Gesendet: Freitag, 8. Mai 2015 18:55
Betreff: Re: determine "big" documents in the index?

bq: has 30'860'099 terms. Is this "too much"

Depends on how you indexed it. If you used shingles, then maybe, maybe not. If you just do
normal text analysis, it's suspicious to say the least. There are about 300K words in the
English language and you have 100X that. So either
1> you have a lot of legitimately unique terms, say part numbers,
SKUs, etc. digits analyzed as text, whatever.
2> you have a lot of garbage in your input. OCR is notorious for this,
as are binary blobs.

The TermsComponent is your friend, it'll allow you to get an idea of what the actual terms
are, it does take a bit of poking around though.

There's no good way I know of to know which docs are taking up space in the index. What I'd
probably do is use Tika in a SolrJ client and look at the data as I sent it, here's a place
to start:


On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <> wrote:
> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is this "too
> Another field (the "single word suggestion") has 2'156'218 terms.
> -----Urspr√ľngliche Nachricht-----
> Von: Clemens Wyss DEV []
> Gesendet: Freitag, 8. Mai 2015 15:54
> An:
> Betreff: determine "big" documents in the index?
> Context: Solr/Lucene 5.1
> Is there a way to determine documents that occupy alot "space" in the index. As I don't
store any fields that have text, it must be the terms extracted from the documents occupying
the space.
> So my question is: which documents occupy a most space in the inverted index?
> Context:
> I index approx 7000pdfs (extracted with tika) into my index. I suspect 
> that for some pdf's the extarcted text is not really text but "binary 
> blobs". In order to verify this (and possibly omit these pdfs) I hope 
> to get some hints of Solr/Lucene ;)
View raw message