mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haddad Said <>
Subject Re: Representing key value dataset into Mahout vector
Date Sat, 12 Jan 2013 23:21:17 GMT
Hi Ted

Thanks for the response. I had a quick look at chapter 14 and that part of
the book is about classification, i.e. supervised learning that involves
training. I am looking to run some unsupervised learning algorithm on the
data, I don't have any training data. Hence why I was looking at clustering.

Actually from reading, it seems to me that Apriori or FP-growth are the
most useful algorithms for me to come up with useful information about this
data, but it seems these algorithms have not been implemented in Mahout
yet. So I guess the question to ask is given I have some data in key values
where both keys and values are strings what
unsupervised algorithms are available in Mahout that I can use to learn
about this data?

Many thanks


On 10 January 2013 07:05, Ted Dunning <> wrote:

> Look at the last third of the book, especially chapter 14.
> One important thing to check is whether your integers represent codes or
> actually represent numbers.  Codes should be encoded as key words.
> Hashed vector encoding should work quite well.
> On Wed, Jan 9, 2013 at 10:10 PM, Haddad Said <>
> wrote:
> > Hi,
> >
> > I have a data set in CSV which is a set of key value pairs, the data set
> is
> > huge and the values are a mixture of integers and short strings (i.e. not
> > lengthy texts, but rather key words) and I want to process it using
> > Mahout's clustering algorithms.
> >
> > The issue is in converting this CSV into vectors that can be consumed by
> > Mahout. I have been reading "Mahout In Action" and there seems to be two
> > options for vectorizing, using numeric values with Mahout's DenseVector,
> > RandomAccessSparseVector, and SequentialAccessSparseVector implementation
> > or use Vector Space Model to vectorize text documents.
> >
> > The data I want to vectorize it not really a text document, but as it is
> a
> > huge data set with many different keys and values it is difficult to map
> it
> > to numeric values. What is the best way to vectorize this kind of data
> for
> > use in Mahout?
> >
> > Any pointers would be appreciated.
> >
> > Thanks
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message