mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Jones <dean.m.jo...@gmail.com>
Subject Re: Naive bayes and character n-grams
Date Thu, 10 Oct 2013 12:15:07 GMT
On 10 October 2013 12:46, Ted Dunning <ted.dunning@gmail.com> wrote:
> For language detection, you are going to have a hard time doing better than
> one of the standard packages for the purpose.  See here:
>
> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
>

Thanks for the pointer Ted. I'm a big fan of the Tika project, we use
it for content extraction already. For various reasons though, we have
rolled our own language detector (mainly, neither of these packages
cover all of the languages we need to identify - language-detection
doesn't do Catalan, Tika doesn't do Welsh).

Dean.

Mime
View raw message