lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: determine "big" documents in the index?
Date Sat, 09 May 2015 17:22:14 GMT
1> Right, shingles (and you've set max size to 3) a bazillion
possibilities, so the sky's the limit. It's usually smaller than that
since some patterns of words aren't very likely, but it's still a big
number.  I'd really take a look at the terms that are actually indexed
with TermsComponent or similar. Or perhaps run a test where you
_don't_ shingle and see what the cardinality of the field is. If a
large portion of your terms are garbage, it should be pretty obvious.

2> No way that I know of to tell Tika "don't return suspicious stuff",
and I'm not up enough on the internals of Tika to say much. Perhaps
ask the Tika folks directly?

I was thinking of you using Tika in a client-side SolrJ program. Once
the parsing is done, you can get all the text Tika thinks is valid and
do some examination of it to see whether it's "real". You might get
some good results from simply seeing if each word returned was longer
than some arbitrary number, or whether the codepoint was out of the
range you expect etc. Anything you do will be imperfect unfortunately.


On Sat, May 9, 2015 at 12:11 AM, Clemens Wyss DEV <> wrote:
>> If you used shingles
> I do:
>     <fieldType class="solr.TextField" name="suggest_phrase" positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
>       </analyzer>
>     </fieldType>
> This is more or less what I do
>>2> you have a lot of garbage in your input.
>>OCR is notorious for this,as are binary blobs.
> What does the AutodetectParser return in case of an OCR-Pdf? Can I "detect"/omit an OCR
> -----Urspr√ľngliche Nachricht-----
> Von: Erick Erickson []
> Gesendet: Freitag, 8. Mai 2015 18:55
> An:
> Betreff: Re: determine "big" documents in the index?
> bq: has 30'860'099 terms. Is this "too much"
> Depends on how you indexed it. If you used shingles, then maybe, maybe not. If you just
do normal text analysis, it's suspicious to say the least. There are about 300K words in the
English language and you have 100X that. So either
> 1> you have a lot of legitimately unique terms, say part numbers,
> SKUs, etc. digits analyzed as text, whatever.
> 2> you have a lot of garbage in your input. OCR is notorious for this,
> as are binary blobs.
> The TermsComponent is your friend, it'll allow you to get an idea of what the actual
terms are, it does take a bit of poking around though.
> There's no good way I know of to know which docs are taking up space in the index. What
I'd probably do is use Tika in a SolrJ client and look at the data as I sent it, here's a
place to start:
> Best,
> Erick
> On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <> wrote:
>> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is this
"too much"?
>> Another field (the "single word suggestion") has 2'156'218 terms.
>> -----Urspr√ľngliche Nachricht-----
>> Von: Clemens Wyss DEV []
>> Gesendet: Freitag, 8. Mai 2015 15:54
>> An:
>> Betreff: determine "big" documents in the index?
>> Context: Solr/Lucene 5.1
>> Is there a way to determine documents that occupy alot "space" in the index. As I
don't store any fields that have text, it must be the terms extracted from the documents occupying
the space.
>> So my question is: which documents occupy a most space in the inverted index?
>> Context:
>> I index approx 7000pdfs (extracted with tika) into my index. I suspect
>> that for some pdf's the extarcted text is not really text but "binary
>> blobs". In order to verify this (and possibly omit these pdfs) I hope
>> to get some hints of Solr/Lucene ;)

View raw message