lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From climbingrose <>
Subject Re: Accented search
Date Wed, 12 Mar 2008 02:01:40 GMT
Hi Peter,

It looks like a very promising approach for us. I'm going to implement an
custom Tokeniser based on your suggestions and see how it goes. Thank you
all for your comments!


On Wed, Mar 12, 2008 at 2:37 AM, Binkley, Peter <>

> We've done this in a pre-Solr Lucene context by using the position
> increment: when a token contains accented characters, you add a stripped
> version of that token with a zero increment, so that for matching purposes
> the original and the stripped version are at the same position. Accents are
> not stripped from queries. The effect is that an accented search matches
> your Doc A, and an unaccented search matches Docs A and B. We do that after
> lower-casing the token.
> There are some limitations: users might start to expect that they can
> freely add accents to restrict their search to accented hits, but if they
> don't match the accents exactly they won't get any hits: e.g. if a word
> contains two accented characters and the user only accents one of them in
> their query, they won't match the accented or the unaccented version.
> Peter
> Peter Binkley
> Digital Initiatives Technology Librarian
> Information Technology Services
> 4-30 Cameron Library
> University of Alberta Libraries
> Edmonton, Alberta
> Canada T6G 2J8
> Phone: (780) 492-3743
> Fax: (780) 492-9243
> e-mail:
> ~ The code is willing, but the data is weak. ~
> -----Original Message-----
> From: climbingrose []
> Sent: Monday, March 10, 2008 10:01 PM
> To:
> Subject: Accented search
> Hi guys,
> I'm running to some problems with accented (UTF-8) language. I'd love to
> hear some ideas about how to use Solr with those languages. Basically, I
> want to achieve what Google did with UTF-8 language.
> My requirements including:
> 1) Accent insensitive search and proper highlighting:
>  For example, we have 2 documents:
>  Doc A (title:Lập Trình Viên)
>  Doc B (title:Lap Trinh Vien)
>  if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập
> Trình Viên" is highlighted.
>  On the other hand, if the query is "Lap Trinh Vien", Doc A is also
> matched.
> 2) Assign proper scores to accented or non-accented searches:
>  if the user enters "Lập Trình Viên", then Doc A should be given higher
> score than DOC B.
>  if the query is "Lap Trinh Vien", Doc A should be given higher score.
> Any ideas guys? Thanks in advance!
> --
> Regards,
> Cuong Hoang


Cuong Hoang
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message