lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <>
Subject Re: Too many unique terms
Date Wed, 24 Apr 2013 23:04:14 GMT
Hi Manuel,

On Thu, Apr 25, 2013 at 12:29 AM, Manuel LeNormand
<> wrote:
> Hi there,
> Looking at my index (about 1M docs) i see lot of unique terms, more
> than 8M which is a significant part of my total term count. These are very
> likely useless terms, binaries or other meaningless numbers that come with
> few of my docs.

If you are only interested in letters, one option is to change your
analysis chain to use LetterTokenizer. This tokenizer will split on
everything that is not a letter, filtering out numbers and binary

> I am totally fine with deleting them so these terms would be unsearchable.
> Thinking about it i get that
> 1. It is impossible apriori knowing if it is unique term or not, so i
> cannot add them to my stop words.
> 2. I have a performance decrease cause my cached "hot spot" chuncks (4kb)
> do contain useless data. It's a problem for me as im short on memory.
> Q:
> Assuming a constant index, is there a way of deleting all terms that are
> unique from at least the dictionary tim and tip files? Do i need to enter
> the source code for this, and if yes what par of it?

If frequencies are indexed, you can pull a TermsEnum, iterate through
the terms dictionary and delete terms that are less frequent than a
given threshold. As you said, this will however prevent your users
from searching for these terms anymore.

>  Will i get significant query time performance increase beside the better
> RAM use benefit?

This is hard to answer. Having fewer terms in the terms dictionary
should make search a little faster but I can't tell you by how much.
You should also try to disable features that you don't use. For
example, if you don't need positional information or frequencies,
IndexOptions.DOCS_ONLY will make your postings lists smaller.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message