mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Scholten <>
Subject Re: Vectorizing arbitrary value types with seq2sparse
Date Fri, 06 May 2011 20:52:59 GMT
Hmm, seems more complex that I thought. I thought of a simple approach
where you could configure your own class that concatenated the desired
fields into one Text value and have the SequenceFileTokenizerMapper
process that value.

But this can give unexpected results? I guess it may find incorrect
n-grams from tokens that were from different fields.

On Fri, May 6, 2011 at 10:17 PM, Ted Dunning <> wrote:
> This is definitely desirable but is very different from the current tool.
> My guess is the big difficulty will be describing the vectorization to be
> done.  The hashed representations would make that easier, but still not
> trivial.  Dictionary based methods add multiple dictionary specifications
> and also require that we figure out how to combine vectors by concatenation
> or overlay.
> On Fri, May 6, 2011 at 1:02 PM, Frank Scholten <>wrote:
>> Hi everyone,
>> At the moment seq2sparse can generate vectors from sequence values of
>> type Text. More specifically, SequenceFileTokenizerMapper handles Text
>> values.
>> Would it be useful if seq2sparse could be configured to vectorize
>> value types such as a Blog article with several textual fields like
>> title, content, tags and so on?
>> Or is it easier to create a separate job for this or use Pig or
>> anything like that?
>> Frank

View raw message