lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wu, Stephen T., Ph.D." <>
Subject Re: Lucene for a linguistic corpus
Date Wed, 09 Jan 2013 15:10:34 GMT
>> For an example, in the phrase "A man saw a elephant" "saw" has annotations as
>> follows (we also say that its position in index is 1234):
>> {lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number:
>> singular}
>> I think, it would be more effective to insert parse index in each attribute's
>> posting list entry as a payload and use it at the intersectiion stage. E.g.,
>> we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a
>> posting list for 'number = Singular': ...|...|2.1234|...|... While processing
>> a query like 'pos = Verb AND number = singular' at all stages of posting list
>> processing 'x.1234' will be accepted until the intersection stage at which
>> they will be rejected because of non-corresponding parse indexes.
We're working on something very similar.
Are there really posting lists like this (e.g., separate lists for pos=Verb,
number=Singular) for things in Payloads?  I think some previous discussion
was saying this kind of posting list is not available.  I couldn't find
anything like that in the documentation about the index format. If there
are, this would be really efficient.

> You might be able to insert your parses as payloads on a term and then
> implement a scorer extension (override computePayloadFactor) to handle your
> join cases for a given word.  You may also need to extend PayloadQuery or
> PayloadTermQuery.  Note, I don't know how well this will perform.
We've done it this way before, storing a slightly different set of
information in the Payload.  I thought making use of a Payload, though,
requires you to iterate through all the tokens, whether in the Analyzer
(i.e., in a TokenFilter) or Similarity (in an overridden scorePayload()

If I'm right, then filtering this out at intersection time might not be
quite as efficient as you're talking about, but it's definitely a reasonable
way to do it.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message