lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: NGram Language Categorization Source
Date Sat, 20 Aug 2005 20:25:08 GMT
Kevin Burton wrote:
> Hey lucene guys.
> I know for a fact that a bunch of you have been curious about language
> categorization for a long time now and Java has lacked a solid way to
> solve this problem.
> Anyway.  This new library that I just released should be easy to tie
> into your lucene indexers.  Just use the library on a text (strip the
> HTML) and then create a new field in Lucene called LANG (or soemthing)
> and then create a filter before you search with JUST that language
> code.
> I'd love some help with filling out missing languages if anyone has
> some spare time.  That help make up for all the hard work I've done
> here (nudge.. nudge)
> I did a full research of the lang categorization space for Java and I
> think this is basically the only library out there.

Erhm... Not to rain on your parade, but Googling for "ngram java" gives 
a lot of hits. and also 
"languageidentifier" in Nutch are two examples of Open Source Java 
implementations. Each can be used with Lucene.

A lot depends on the reference profiles (which in turn depend on the 
quality of your training corpus - in this case, your corpus is not the 
best choice, because each text contains a lot of foreign words). It was 
also found that the way you create ngram profiles (e.g. with or without 
surrounding spaces, single length or mixed length) affects the LI 
performance. For documents with mixed languages it was also found that 
methods, which combine ngrams with stopwords, work better.

Additionally, simple methods based on cosine similarity (or delta 
ranking) don't give correct results for documents with mixed languages. 
In such cases input texts are chunked, and each chunk is analyzed 
separately, and then the scores are combined... etc, etc... millions of 
ways you can do this - and of course no method is perfect. :-)

So, there is still a lot to do in this area, if you come up with some 
unique way of improving LI performance...

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message