lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Morris (JIRA)" <>
Subject [jira] Commented: (LUCENE-1292) Tag Index
Date Fri, 06 Jun 2008 09:18:45 GMT


Christopher Morris commented on LUCENE-1292:

The dynamic index is an ordinary Lucene index, wrapped to resemble a dynamic index.

Each modification to a dynamic term creates a document. The document has two fields: one is
the dynamic term field, the other is "PK" post-pended by the dynamic term field. The "PK"
field contains the primary key post-pended with the term. The dynamic term field contains
the dynamic term text post-pended by either ADD or DEL with term position representing the
primary key. There can be multiple additions and deletions in the same document.

The indexReader.docFreq() for a dynamic term is the sum of the termDocs freq for dynamic term
ADD minus the sum of the dynamic term DEL. terms() is the underlying terms() for all fields
not starting "PK", filtered by whether the dynamic term still exists (docFreq()>0). Retreiving
terms for primary key/field combination involves the TermEnum for all terms with field ("PK"
+ field) starting with text (primary key). Terms with an odd docFreq() still exist (been added
more times than deleted). Term Docs involves using TermPositions for ADD and DEL to seek through
the index toggling the primary keys as exist/not exist.

To test performance I used the Enron corpus (~ 500,000 docs) that has a folder structure (3503
nodes, max depth ~6). Ran queries for each level in the hierachy (PrefixQuery) and saved the
results as a dynamic term.

The results for a TermQuery search for the dynamic term compared to the original query varied
from identical to four times slower, in a shark's tooth pattern with a frequency of 125 querys.
The shark's tooth pattern does not match folder depth (cause of shark's tooth is currently

I am currently running a similar test for dynamic terms that have been dynamic. As above,
but all nodes are set to the results for the first node, then all but the first are set to
the value of the second, etc. The last node will have been modified 3503 times. Modifying
this amount of data is slow.

I should be able to release the code if you wanted a direct comparison. The external APIs
are similar: startBulkLoad(), addTerm(term, primary key), deleteTerm(term,primary key), acceptBulkLoad().

> Tag Index
> ---------
>                 Key: LUCENE-1292
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
> The problem the tag index solves is slow field cache loading and range queries, and reindexing
an entire document to update fields that are not tokenized.  
> The tag index holds untokenized terms with a docfreq of 1 in a term dictionary like index
file.  The file also stores the docs per term, similar to LUCENE-1278.  The index also has
a transaction log and in memory index for realtime updates to the tags.  The transaction log
is periodically merged into the existing tag term dictionary index file.
> The TagIndexReader extends IndexReader and is unified with a regular index by ParallelReader.
 There is a doc id to terms skip pointer file for the IndexReader.document method.  This file
contains a pointer for looking up the terms for a document.  
> There is a higher level class that encapsulates writing a document with tag fields to
IndexWriter and TagIndexWriter.  This requires a hook into IndexWriter to coordinate doc ids
and flushing segments to disk.  
> The writer class could be as simple as:
> {code}
> public class TagIndexWriter {
>   public void add(Term term, DocIdSetIterator iterator) {
>   }
>   public void delete(Term term, DocIdSetIterator iterator) {
>   }
> }
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message