mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haddad Said <>
Subject Representing key value dataset into Mahout vector
Date Thu, 10 Jan 2013 06:10:24 GMT

I have a data set in CSV which is a set of key value pairs, the data set is
huge and the values are a mixture of integers and short strings (i.e. not
lengthy texts, but rather key words) and I want to process it using
Mahout's clustering algorithms.

The issue is in converting this CSV into vectors that can be consumed by
Mahout. I have been reading "Mahout In Action" and there seems to be two
options for vectorizing, using numeric values with Mahout's DenseVector,
RandomAccessSparseVector, and SequentialAccessSparseVector implementation
or use Vector Space Model to vectorize text documents.

The data I want to vectorize it not really a text document, but as it is a
huge data set with many different keys and values it is difficult to map it
to numeric values. What is the best way to vectorize this kind of data for
use in Mahout?

Any pointers would be appreciated.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message