lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Performance problem
Date Wed, 24 Aug 2005 13:30:12 GMT

On Aug 24, 2005, at 3:32 AM, WolfgangTäger wrote:

> Dear all,
> we are using Lucene to store 10Mio bilingual sentence pairs for  
> doing some
> natural language processing with them. Each documents contains a  
> sentence,
> its translation and a topical code. We want to select sentences  
> containing
> certain words and do statistics over the topical codes in order to  
> detect
> translations which depend on the topic (like key=> Taste (topic: input
> devices), key=> Schlüssel (topic: cryptography)).
> While the search is carried out in a reasonably short time (about
> 500..800ms) we have a performance problem with actually retrieving the
> documents by code like:
> for (int i = nrhits-1; i >=0; i--){
>         Document hitDoc = hits.doc(i);
>         String code=hitDoc.get("code");
>         ... statistics
> }
> Even when restricting nrhits to 2000, we have to wait 10..20  
> seconds just
> for the retrieval. Since the documents are so short we would have  
> expected
> a quicker retrieval. BtW the loop was done in inverse order in the  
> hope to
> accelerate the retrieval.

How many documents are you trying to retrieve?   I think you'll have  
much better luck if you walked the documents in ascending Hits order  
than backwards, as Hits caches documents with the presumption you'll  
move forward through them.  I'd be curious to see how much (or if)  
moving forwards through Hits helps.


View raw message