lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nadav Har'El" <>
Subject Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)
Date Wed, 01 Oct 2008 20:02:40 GMT
On Tue, Sep 30, 2008, Robert Muir wrote about "Re: [jira] Commented: (LUCENE-1406) new Arabic
Analyzer (Apache license)":
> Thanks for clarification. With this method arabic analyzer could lemmatize,
> not stem, using buckwalter dictionary, and things like broken plural will
> work correctly.
> I'm not sure yet if hspell has this type of information, but it would at
> least be a better stem for hebrew as well.

Indeed Hspell also has this information. You can see for example
(but you'll need to be able to read Hebrew to understand what this means).

But one thing to remember is that if you use Hspell, or basically any other
dictionary, you are committing yourself to a particular vocabulary and a
particular spelling of it. If your stemmer comes across a word outside your
vocabulary, or spelled a bit differently, it won't know what to do with it.

This problem is particularly visible in Hebrew, because its unvowelled
spelling standard (defined by the Academy of the Hebrew Language) is
not very well known - When I was in school, twenty years ago, it wasn't
even mentioned, let alone taught! As a result, some words have a few spelling
variants in the wild, with each dictionary typically considering one correct
and the others mispellings.

Nadav Har'El                        |    Wednesday, Oct  1 2008, 3 Tishri 5769
IBM Haifa Research Lab              |-----------------------------------------
                                    |The two most common elements in the           |universe are hydrogen and stupidity.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message