lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <>
Subject Re: spellcheck index blown away during rebuild
Date Mon, 23 Aug 2010 05:51:05 GMT
  On 8/20/2010 8:56 PM, Lance Norskog wrote:
> The first question is about your use cases. How many words are in the
> eventual 3GB spelling index? Do you really need that many?
> Spell-checking is a more controllable UI if you make it from a
> dictionary.

It's built from an index-only field that combines four other fields.  
The data we are indexing is metadata from photos, text articles, and 
videos, with most of it being photos.  On a single shard, the schema 
browser shows * * 23612208 distinct terms in the catchall field, from 
7305684 documents.  If it's a one-to-one relationship, there you go.

Perhaps I need to make another catchall field that leaves out the "full" 
text field.  I'll have to experiment, because my index is already bigger 
than I want it to be.  I have no budget for throwing more hardware at 
the problem.  We are in the process of rewriting our application so that 
we can reduce our index size, but that is still a few months out.

Aside from the index itself, I'm not sure where I'd get an appropriate 
dictionary for photo metadata that would not require major manual work.  
Is there any easy way to get the full list of distinct terms and their 
counts? I'd imagine that if I could filter out those with only a handful 
of occurrences, the list would be dramatically smaller.  Other filters 
might be useful as well, such as removing those above say 15 or 20 
characters.  Normally I'd go to the facet feature for this sort of 
information, but I'm not sure my servers could handle that.

> What you're talking about is effectively promoting the spellcheck
> index to a first-class Solr index, instead of an appendage bolted on
> the side of an existing core. Given sharding and distributed search,
> this may be a better design.

Can you elaborate on what "this" refers to above?  Are you saying that 
you think promoting it to a full Solr index is a good idea?  I saw a 
Jira issue with the idea of building the spellcheck index at the same 
time as the rest of the index, and storing it in the same directory.  
This sounds like a very good way to go, especially if the filtering I 
mentioned above were a part of the configuration.


View raw message