lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2413) Consolidate all (Solr's & Lucene's) analyzers into modules/analysis
Date Sun, 16 May 2010 20:48:43 GMT


Robert Muir commented on LUCENE-2413:

bq. May this much faster than CharArraySet

I ran indexing tests a while ago (reuters) with CharArraySet itself implemented with a DFA,
and it was slightly faster, but not much. I think this is because english words are usually
not very long (average length=5). For other languages this technique might save some cpu time,
but there are some "problems" i imagine

# building an automaton from a list of words is more expensive, although Dawid Weiss has implemented
an addition to automaton that does this fast.
# in general building automaton and runautomaton etc is more "heavy" i would think, but Mike
Mccandless hacked away a lot of this heaviness when we converted to UTF-32.
# the CharacterRunAutomaton is not optimized right now, we disabled the classmap[] for chars
because it consume more RAM. I think if we were to care about performance on char[] we should
make it classmap 0x0-0xffff and binary search the rest, or something similar. currently it
binarysearches on each input character.

Somewhat related, a while ago i tested this with CharArraySet as a DFA, and opened this issue:
LUCENE-2227. But obviously this is not the only way, as this example shows filtering on the
dfa itself (and not using chararrayset at all). 

So in general, i have those concerns right now, but maybe in the future once some things are
addressed we could at least make an optional stopfilter impl or something similar.

One thing i like about this filter personally, is that rejected terms always get (optionally)
the posInc increased... I do not think our existing KeepWord or LengthFilters do this, but
maybe i am wrong.

> Consolidate all (Solr's & Lucene's) analyzers into modules/analysis
> -------------------------------------------------------------------
>                 Key: LUCENE-2413
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Michael McCandless
>            Assignee: Robert Muir
>             Fix For: 4.0
>         Attachments: LUCENE-2413-charfilter.patch, LUCENE-2413-PFAW+LF.patch, LUCENE-2413_commongrams.patch,
LUCENE-2413_folding.patch, LUCENE-2413_htmlstrip.patch, LUCENE-2413_keep_hyphen_trim.patch,
LUCENE-2413_mockfilter.patch, LUCENE-2413_mockfilter.patch, LUCENE-2413_pattern.patch, LUCENE-2413_porter.patch,
LUCENE-2413_removeDups.patch, LUCENE-2413_synonym.patch, LUCENE-2413_teesink.patch, LUCENE-2413_testanalyzer.patch,
LUCENE-2413_testanalyzer.patch, LUCENE-2413_tests2.patch, LUCENE-2413_wdf.patch
> We've been wanting to do this for quite some time now...  I think, now that Solr/Lucene
are merged, and we're looking at opening an unstable line of development for Solr/Lucene,
now is the right time to do it.
> A standalone module for all analyzers also empowers apps to separately version the analyzers
from which version of Solr/Lucene they use, possibly enabling us to remove Version entirely
from the analyzers.
> We should also do LUCENE-2309 (decouple, as much as possible, indexer from the analysis
API), but I don't think that issue needs to block this consolidation.
> Once we do this, there is one place where our users can find all the analyzers that Solr/Lucene

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message