lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 1world1love <>
Subject retrieve all docs efficiently - just one field
Date Tue, 10 Jun 2008 22:35:33 GMT

Greetings all. I have read many posts concerning similar use cases, but I am
still a little hazy on the best way to achieve what I need to do. Here is
the background:

2 million documents with multiple sections, some sections contain structured
data, some unstructured.

We parse the docs and place the structured stuff in oracle where each
section is a table and one master table to relate them all.

We index the unstructured sections with lucene where each section is a
document (meaning a total of about ~30 million documents) with extra fields
including one for the primary key of the master table and then some meta
fields to describe the section - type, date, etc.

For a common use case, say we have a table called demographics with a number
field that represents age (overly simplistic but gets the point across).

So say we want all people over the age of 50 who may have visited Panama: 

We have our lucene index and we want to search the section text for the word


We want to select from the demographics table where age > 50.

Now I need to intersect the master table IDs from my lucene hits and my
table results. 

I have a java stored procedure that runs the lucene query and creates a
temporary table with a single column where I insert the master id from the
hits of my lucene query. I then can do a join with my structured query

The problem here is obviously the speed of iterating through the hits to
extract the single field that I need.

- I must be able to get a full set of results, though I only need the one id
- We originally went with Oracle text which was simple, but limited and
quite slow for most queries

I have read a little about the hitcollector class and the fieldselector api,
but I am still not sure how they may help me or even if they can.

I have also tooled around with the idea of using termdocs, but the queries
may get a little complex with various ors/ands/nots, though probably not
spans and so forth.

Any suggestions will be greatly apreciated.



View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message