mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Jones <>
Subject Re: Naive bayes and character n-grams
Date Thu, 10 Oct 2013 08:01:32 GMT
Hi Si,

On 10 October 2013 07:59, <> wrote:
> what do you mean by character n-grams? If you mean things like "&ab" or
"ui2" then given that there are so few characters compared to words is
there a problem that can't be solved without a look-up table for n<y (where
y <4ish )
> Or are you looking at y >4 ish because if so then do you run into the
issue of a sudden space explosion?

Yes, just tokens in a text broken up into sequences of their constituent
characters. In my initial tests, language detection works well where n=3,
particularly when including the head and tail bigrams. So I need something
to generate the required sequence files from my training data.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message