lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Scott" <>
Subject Question about querying for files in a zip file
Date Fri, 08 Jun 2007 18:54:26 GMT
This isn't a "How do I index a zip file?" question.  It's a bit more
complicated than that.

We have an index where zip files are broken apart and the contained
files are indexed.  The index also contains a doc for the zip file
itself.  The user has the option of (A) querying for the contained files
that match the query (a vanilla query), or (B) querying for the unique
set of zip files that have contained files that match the query.  My
question is how to *efficiently* accomplish option (B) in Lucene.

In case it helps, here's another way to explain the requirement in a
relational model.  If you had a table of docs with these columns:

    MyDocs table
    Other columns to match on...

then option (B) can be returned with a simple join:

    select distinct zip.docid, zip.other-columns, ...
    from mydocs zip, mydocs contained
        contained.zipfilename = zip.filename
        and contained.docid matches lucene query...

In lucene, the conceptual, straght-forward solution is something like

    Do a lucene query to get the matching contained docs.
    For each matching doc:
        Look up the zip filename via a field on the doc.
        If the zip file is not part of our zipfile result set yet, then
            Save the zip filename in the result set.
    Run another lucene query to look up the zipfile docids in the
zipfile result set.
    Read any required fields for each zipfile doc.
    Return the zipfile result set with the required fields.

The trouble with this solution is that it is very slow and a memory hog.
Does anyone have any nifty ideas that beat this straight-forward

We would also entertain alternative indexing approaches.  We even
considered concatenating all the text of the contained docs into a doc
indexed as the zipfile, but lucene only indexes part of a large file and
even if that were resolved, proximity searches can return false

And FYI, scoring is not an issue on the zip file.  It's purely match or
no-match semantics.


- Eric Scott

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message