lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex <>
Subject Re: Storing payloads without term-position and frequency
Date Thu, 03 Feb 2011 20:49:10 GMT
Hello Grant,

I am currently storing the first term instance only because I just index
each token for an article once. What I want to achieve is an index for
versioned document collections like wikipedia (See this paper 

In detail I create on the first level (Lucene) a document for one
wikipedia article containing all distinct terms of its versions. On the
second level (payloads) I store the frequency information corresponding
to each article version and its terms. If I search now I can find an
article by its term and through the term and its payload I receive
informations about the other versions and how often a token occured (In
my case with one term the payload pos is always 1!). So I look on the
first level and pick only the information from the second level which I
need. By this I can avoid storing informations several times because
most wikipedia versions are very similar (in term context).

This is working so far and I just want to reduce my index size but I
don't know how much I can save by disabling term freqs/pos.
I hope I could explain the problem a little bit. If not just tell me I
try to explain it again. :)

Best regards

PS: I am currently looking for a bedroom in New York, Brooklyn (Park
Slope or near NYU Poly). Maybe somebody rents a room from 15 Feb until
15 April. :)

Am Donnerstag, den 03.02.2011, 12:38 -0500 schrieb Grant Ingersoll:
> Payloads only make sense in terms of specific positions in the index, so I don't think
there is a way to hack Lucene for it.  You could, I suppose, just store the payload for the
first instance of the term.
> Also, what's the use case you are trying to solve here?  Why store term frequency as
a payload when Lucene already does it (and it probably does it more efficiently)
> -Grant
> On Feb 2, 2011, at 2:35 PM, Alex vB wrote:
> > 
> > Hello everybody,
> > 
> > I am currently using Lucene 3.0.2 with payloads. I store extra information
> > in the payloads about the term like frequencies and therefore I don't need
> > frequencies and term positions stored normally by Lucene. I would like to
> > set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve
> > payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only
> > store one payload per term if that information makes it easier.
> > 
> > Best regards
> > Alex
> > -- 
> > View this message in context:
> > Sent from the Lucene - Java Users mailing list archive at
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> > 
> --------------------------
> Grant Ingersoll
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message