lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <>
Subject Re: [Performance] Streaming main memory indexing of single strings
Date Wed, 27 Apr 2005 01:47:58 GMT
I've uploaded slightly improved versions of the fast MemoryIndex  
contribution to  
along with another contrib - PatternAnalyzer.
For a quick overview without downloading code, there's javadoc for it  
all at 

I'm happy to maintain these classes externally as part of the Nux  
project. But from the preliminary discussion on the list some time ago  
I gathered there'd be some wider interest, hence I prepared the  
contribs for the community. What would be the next steps for taking  
this further, if any?


  * Efficient Lucene analyzer/tokenizer that preferably operates on a  
rather than a
  * {@link}, that can flexibly separate on a regular  
{@link Pattern}
  * (with behaviour idential to {@link String#split(String)}),
  * and that combines the functionality of
  * {@link org.apache.lucene.analysis.LetterTokenizer},
  * {@link org.apache.lucene.analysis.LowerCaseTokenizer},
  * {@link org.apache.lucene.analysis.WhitespaceTokenizer},
  * {@link org.apache.lucene.analysis.StopFilter} into a single efficient
  * multi-purpose class.
  * <p>
  * If you are unsure how exactly a regular expression should look like,
  * prototyping by simply trying various expressions on some test texts  
  * {@link String#split(String)}. Once you are satisfied, give that  
regex to
  * PatternAnalyzer. Also see <a target="_blank"
  * href="">Java  
Expression Tutorial</a>.
  * <p>
  * This class can be considerably faster than the "normal" Lucene  
  * It can also serve as a building block in a compound Lucene
  * {@link org.apache.lucene.analysis.TokenFilter} chain. For example as  
in this

  * stemming example:
  * <pre>
  * PatternAnalyzer pat = ...
  * TokenStream tokenStream = new SnowballFilter(
  *     pat.tokenStream("content", "James is running round in the  
  *     "English"));
  * </pre>

On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote:

> I've now got the contrib code cleaned up, tested and documented into a  
> decent state, ready for your review and comments.
> Consider this a formal contrib (Apache license is attached).
> The relevant files are attached to the following bug ID:
> For a quick overview without downloading code, there's some javadoc at  
> summary.html
> There are several small open issues listed in the javadoc and also  
> inside the code. Thoughts? Comments?
> I've also got small performance patches for various parts of Lucene  
> core (not submitted yet). Taken together they lead to substantially  
> improved performance for MemoryIndex, and most likely also for Lucene  
> in general. Some of them are more involved than others. I'm now  
> figuring out how much performance each of these contributes and how to  
> propose potential integration - stay tuned for some follow-ups to  
> this.
> The code as submitted would certainly benefit a lot from said patches,  
> but they are not required for correct operation. It should work out of  
> the box (currently only on 1.4.3 or lower). Try running
> 	cd lucene-cvs
> 	java org.apache.lucene.index.memory.MemoryIndexTest
> with or without custom arguments to see it in action.
> Before turning to a performance patch discussion I'd a this point  
> rather be most interested in folks giving it a spin, comments on the  
> API, or any other issues.
> Cheers,
> Wolfgang.
> On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
>> On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
>>> On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
>>>> By the way, by now I have a version against 1.4.3 that is 10-100  
>>>> times faster (i.e. 30000 - 200000 index+query steps/sec) than the  
>>>> simplistic RAMDirectory approach, depending on the nature of the  
>>>> input data and query. From some preliminary testing it returns  
>>>> exactly what RAMDirectory returns.
>>> Awesome.  Using the basic StringIndexReader I sent?
>> Yep, it's loosely based on the empty skeleton you sent.
>>> I've been fiddling with it a bit more to get other query types.   
>>> I'll add it to the contrib area when its a bit more robust.
>> Perhaps we could merge up once I'm ready and put that into the  
>> contrib area? My version now supports tokenization with any analyzer  
>> and it supports any arbitrary Lucene query. I might make the API for  
>> adding terms a little more general, perhaps allowing arbitrary  
>> Document objects if that's what other folks really need...
>>>> As an aside, is there any work going on to potentially support  
>>>> prefix (and infix) wild card queries ala "*fish"?
>>> WildcardQuery supports wildcard characters anywhere in the string.   
>>> QueryParser itself restricts expressions that have leading wildcards  
>>> from being accepted.
>> Any particular reason for this restriction? Is this simply a current  
>> parser limitation or something inherent?
>>> QueryParser supports wildcard characters in the middle of strings no  
>>> problem though.  Are you seeing otherwise?
>> I ment an infix query such as "*fish*"
>> Wolfgang.
>> ---------------------------------------------------------------------- 
>> -
>> Wolfgang Hoschek                  |   email:
>> Distributed Systems Department    |   phone: (415)-533-7610
>> Berkeley Laboratory               |
>> ---------------------------------------------------------------------- 
>> -
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message