lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@kodapan.se>
Subject Blåbærsyltetøy v.s. Räksmörgås
Date Wed, 22 May 2013 12:37:46 GMT
This is a question (or perhaps a line of thought) regarding the mutually intelligible Scandinavian
languages Danish, Norwegian and Swedish.


The Swedish letters åäö is in fact the same letters as the Danish/Norwegian åæø. A Norwegian
writing about the Swedish city of Göteborg write Gøteborg and a Swedish person writing about
Svolvær will write Svolvär. This is easy to fix, I can just index synonyms where äö is
replaced by æø and vice verse.

More problematic, at least in my head, is ASCII-folding.

When a Swedish person is lacking umlauted characters on the keyboard they consistently type
a, a, o instead of å, ä, ö. Foreigners also tend to use a, a, o. 

In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use a, a, o. I've
also seen oo, ao, etc. And permutations. Not sure about Denmark but the pattern is probably
the same. I have no clue to what letters foreigners might be replacing them with.


There's a lot of mismatch here. For a start ASCIIFoldingFilter translate 'ä' to 'a' and 'æ'
as 'ae'. The rest is not aligned with what people actually type, such as 'ø' to 'o' rather
than the more common 'oe'.


I'm considering:

* Forking ASCIIFoldingFilter with a bunch of strategies and index permutations of synonyms.
or
* Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo, and other
combination of double vowels, just keeping the first one.



Anyone else that thought about this?



			karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message