lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernhard Haslhofer <>
Subject multi-term synonym expansion
Date Tue, 06 Jul 2010 13:02:25 GMT

I am currently working on a Lucene module that makes use of controlled SKOS vocabularies (
during index and search time. It should work similar to Lucene's Wordnet contrib module, just
with some extended SKOS-specific functionality (e.g., support for broader & narrower relationships).
Work is still very much in progress; first results are available here:

My custom SKOSAnalyzer already performs synonym expansion based on the labels defined in a
given SKOS model. But now I have the problem that real-world thesauri often define (multi
terms) synonyms for mult-term words. Here is an example that defines the abbreviation "UN"
as synonym for "United Nations"

<skos:Concept rdf:about="">
      <skos:prefLabel>United Nations</skos:prefLabel>

At the end the analyzer should add the term UN at the right position in the index. Taking
the example above, a sentence "I work for the United Nations" should appear in the index as

2: [work: 2-> 6]
5: [united nations: 15->29] [un: 15->29] that a query "I work for the UN" also matches the document.

What is the best solution to implement that. With a TokenFilter I can work through the sentence
token by token (using incrementToken()) and check if there is a synonym available. How can
I analyze token sequences in a given text? Do I need to implement a custom tokenizer that
recognizes entities based on a given dictionary?

I am grateful for any suggestions or advice.

Thank you,


Research Group Multimedia Information Systems
Department of Distributed and Multimedia Systems
Faculty of Computer Science
University of Vienna

Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message