lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Täger <>
Subject Re: Lucene - Search Optimization Problem
Date Tue, 26 Feb 2008 07:55:07 GMT
Hi João,

if you need 10.000 or more hits, this might require 10.000 or more disk 
Given the access time of disks, there is probably no way to get 
significantly faster
using Lucene on the same hardware.

Either you can organise your data so that it is more local on hard disk 
(what you probably can't), or
you need to use memory with lower access time than hard disks, say more 
RAM for caching,
SSD or other flash drives.

You may try a cheap 8GB USB stick with low access time. 
Another possibility is to use a suitable OS with at least 8GB of RAM.

If you do so, please share your results.

Best regards,

Wolfgang Täger

"João Rodrigues" <> 
24-02-2008 16:19
Please respond to


Lucene - Search Optimization Problem

Hello all!

I've finally got round to setup Lucene 2.3.0 in my two production boxes
(Ubuntu 7.10 and Windows XP), after quite a trouble with the JCC 
methods. Now, I have my application all up and running and.... it's damn
slow :( I'm running PyLucene by the way, and I've asked on that list
already, being directed here.

I have a 6.6GB index, with more than 5.000.000 biomedical abstracts 
Each document has two fields: an integer, which I will want to retrieve 
search (the ID of the document, sort of), and an 80 words, stored,
tokenized, string, which will be searched upon. So, I insert the query 
foo bar), it builds previously sort of a "boolean query" with a format 
as: 'foo' AND 'bar'. Then it parses it and spits out the results.

Problem is, unlike most of the posts I've read, I don't want the first 10 
100 results. I want the first 10.000, or even all of them. I've read an
HitCollector is due for this task, but my first search on google got me an
expressive "HitCollector is too slow on PyLucene", so, I kind of sorted 
that option. It takes minutes to get me the results I need, as it is right
now. I'll post the code on pastebin and link it for those who feel in a 
mood to read n00b's code and help (see below). I've tracked down the 
to the "doc.get("PMID")" method in the Searcher function.

My question is: how can I make my search faster? My index wasn't optimized
because it was huge and it was built with GCC. By now, it is probably
optimized (I left an optimizer running last night) so, that is taken care
of. I've considered threading as well, since I'll perform three different
searches per "round". Thing is, I'm pretty green when it comes to
programming (I'm a biologist) and I've never understood pretty much how
threading works. If someone can point me to the right tutorial or
documentation, I'd be glad enough to hack it up myself. Another option 
been given was to use an implementation of Lucene written in either C# or
C++. However, <> isn't up to date, and 
is CLucene..

So, if you think you can give out a tip on how to make my script run 
I'd thank you more than a lot. It's a shame that my project fails because 
this technical handicap :(

LINKS: -> Main Code --> Searcher Functions

Best regards to you all,

João Rodrigues

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message