lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomasz Elendt <tomasz.ele...@gmail.com>
Subject high precision CompoundWordTokenFilter
Date Mon, 12 Aug 2019 14:57:09 GMT
Hey,

I'm trying to build a high-precision decompounder. I tried to use Dictionary- and Hyphenation-
CompoundWordTokenFilters with dictionary for German language (I used the one prepared by Uwe
Schindler [1]) but noticed one class of false positives that worry me. When I try to extract
compounds from "Klavierkonzert"[2] I get ["klavier", "vier", "konzert"]. I understand that
I get "vier" as it's there in the dictionary, but it's not right as in this case "klavier"
already covers that.

I checked documentation and found `onlyLongestMatch` param, thinking that it might solve my
problem, but unfortunately I doesn't work that way. Lucene's `testHyphenationCompoundWordsDELongestMatch`
demonstrates how it works [3], where "basketball" match excludes potential "basket" match,
but not "ball" (similarly "klavier" does not exclude "vier").

Things are even more complicated as I'm not even sure what language I'm dealing with - German
is quite frequent in the corpus, but there are more languages there. (Also: the documents
are really short and they sometimes contain words from multiple languages).

I thought about applying dictionary based CompoundWordTokenFilters on the corpus, getting
N top compound words, manually filtering false positives and translating those results into
SynonymMap for SynonymFilter. Even top 1k rules like that should significantly improve recall
for my users without scarifying precision. But I'm thinking that maybe there's a better way.

What if I used CompoundWordTokenFilter that would only emit a sequence of dictionary items
that would construct a given word if concatenated? That would solve the "klavier" problem
(as there's no "kla" word in the dictionary, but even if there was, "klavier" is longer than
"kla" and "vier" and would take the precedence). Would that make sense?



[1] https://github.com/uschindler/german-decompounder
[2] https://en.wiktionary.org/wiki/Klavierkonzert#German
[3] https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java#L81
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message