mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Rogoff <>
Subject Feature weighting
Date Sun, 08 Dec 2013 17:09:42 GMT
    I'm training a naive bayes classifier on some structured documents. The
documents have several fields: title, description, body, breadcrumb, etc.
I'd like to weight the tokens from the different fields. For example, let's
say I just use the title and body fields; I'd like tokens from the title to
be weighted three times as much as the body tokens. Right now, when I'm
creating the sequence files I would just concatenate the title string three
times with the body string and let seq2sparse do it's work. Is there a
better way?

    One possibility would be to preserve some structure in the value field
of the sequence file, perhaps using '|' or ';' to separate fields, and then
pass a
special analyzer which understands this syntax to seq2sparse. Is that a
sensible approach? What do others do in this situation? Thanks in advance.

-- Brian

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message