lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <>
Subject Lucene for a linguistic corpus
Date Sat, 05 Jan 2013 12:36:55 GMT

I'm considering Lucene as an engine for linguistic corpus search.

There's a feature in this search: each word is treated as ambiguuos - i.e., it has got multiple
sets of grammatical annotations (there's a fixed maximum of these sets number - a word can
have at most 8 parses).
For an example, in the phrase "A man saw a elephant" "saw" has annotations as follows (we
also say that its position in index is 1234):

{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}

Normally, we index each annotation as an independent feature (i.e., there will be posting
lists for "lemma", "pos", "number", etc.). And the problem is, for the query "pos = Verb AND
number = Singular" we DON'T want to find the position 1234 because they appeared in different

As a solution one may consider indexing all annotation subsets (this would increase index
size and queries complicatedness), searching for regexps (but the search will be dead slow),
or indexing parses, not words (but queries with given distance between words will break up)
- these solutions are not acceptable.

I think, it would be more effective to insert parse index in each attribute's posting list
entry as a payload and use it at the intersectiion stage. E.g., we have a posting list for
'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting
list processing 'x.1234' will be accepted until the intersection stage at which they will
be rejected because of non-corresponding parse indexes.

I am also new to Lucene, so could you please tell me if this idea is implementable in Lucene,
and how much effort does the implementation take?

Best Regards,
Igor Shalyminov

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message