mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Naive bayes and character n-grams
Date Thu, 10 Oct 2013 13:16:14 GMT
Cool. Sounds like you are ahead of the game.  

Sent from my iPhone

On Oct 10, 2013, at 13:15, Dean Jones <dean.m.jones@gmail.com> wrote:

> On 10 October 2013 12:46, Ted Dunning <ted.dunning@gmail.com> wrote:
>> For language detection, you are going to have a hard time doing better than
>> one of the standard packages for the purpose.  See here:
>> 
>> http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
> 
> Thanks for the pointer Ted. I'm a big fan of the Tika project, we use
> it for content extraction already. For various reasons though, we have
> rolled our own language detector (mainly, neither of these packages
> cover all of the languages we need to identify - language-detection
> doesn't do Catalan, Tika doesn't do Welsh).
> 
> Dean.

Mime
View raw message