lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: NGram Language Categorization Source
Date Mon, 22 Aug 2005 07:51:43 GMT
Kevin Burton wrote:

>>A lot depends on the reference profiles (which in turn depend on the
>>quality of your training corpus - in this case, your corpus is not the
>>best choice, because each text contains a lot of foreign words).
> I realize that my corpus isnt' the best.  That's one of the reason's
> I've open source'd it.  The main improvement in ngramcat (my code) is
> that if the result isn't obvious we throw an Exception so
> theoreticallyi we won't see any false positives unless the language
> categorization is WAY off.

That's also how other implementations do it - you need to set an 
arbitrary threshold, and if the profiles score below that threshold then 
an "unknown" value is returned (or null, or Exception).

>>It was
>>also found that the way you create ngram profiles (e.g. with or without
>>surrounding spaces, single length or mixed length) affects the LI
> LI???

LI = Language Identification. Sorry for the confusion.

> I haven't benchmarked it but I'd be interested in any suggestions you have.
>>For documents with mixed languages it was also found that
>>methods, which combine ngrams with stopwords, work better.
> Hm.. interesting.. where?  URL I can reads?

Someone mentioned the Linguini paper, where they found that using "short 
words" features gives similarly good performance as using ngrams.

See also the following papers :

In general, using stop words works only for texts above certain minimum 
length (greater than with n-gram methods), and then gives nearly 100% 

>>So, there is still a lot to do in this area, if you come up with some
>>unique way of improving LI performance...
> Maybe I'm being dense but what is LI performance?

Language Identification performance - in the sense that a given 
identifier "performs" better if it can correctly identify more 
languages, using shorter input text.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message