mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <smar...@apache.org>
Subject Re: Row Similarity
Date Thu, 14 May 2015 18:31:32 GMT
There used to be an online page on mahout.apache.org that Pat Ferrel had
put together few years ago.
Not sure if its still around, Pat ???

If not, I can write up more detailed steps later today and send it ur way.

On Thu, May 14, 2015 at 2:18 PM, Jonathan Seale <jonathanpseale@gmail.com>
wrote:

> Thanks, guys. Can you recommend any resources that show an example of these
> steps? A google search returns very little information. Now I know what to
> do, but I can't find anything that tells me how to do it.
>
>
> On Wed, May 13, 2015 at 11:56 PM, Suneel Marthi <smarthi@apache.org>
> wrote:
>
> > Hi Jonathan,
> >
> > Here's what u gotta do to run RowSimilarity on ur CSV formatted data.
> You
> > would have to use the MapReduce version since the Spark version only
> > supports LLR.
> >
> > 1. Convert CSV to Vectors - use CSVIterator and store the vectors as
> > SequenceFiles
> > 2.  Run RowIDJob on the SequenceFile output of (1). This should generate
> a
> > Matrix of <IntWritable, VectorWriteable> and a docIndex of <IntWritable,
> > Text>
> > 3.  Run RowSimilarityjob on the matrix output from (2) specifiying
> > CosineDistance and a cutoff threshold. This should generate a matrix of
> > Rows -> Most similar rows with distances.
> >
> >
> >
> >
> > On Wed, May 13, 2015 at 11:42 PM, Jonathan Seale <
> jonathanpseale@gmail.com
> > >
> > wrote:
> >
> > > Thanks, Charlie,
> > >
> > > The data has been through lots of processing, but in an attempt to make
> > it
> > > more Mahout-friendly, I've converted it into a single csv table with
> > > columns: star_id, wavelength, intensity. My motivation was to make it
> > like
> > > a user_id, item_id, rating table you might see in other Mahout uses.
> > >
> > > As opposed to using my local machine, I've setup an instance on Amazon
> > with
> > > hopes of turning this into a remote service. So the install is whatever
> > > comes with Amazon's default Mahout installation.
> > >
> > > Jonathan
> > >
> > >
> > >
> > > On Wed, May 13, 2015 at 11:29 PM, Charlie Hack <
> charles.t.hack@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Jonathan, how do you have the data stored? More info about your
> > setup
> > > > the better.
> > > >
> > > >
> > > > Charlie
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > —
> > > > Sent from Mailbox
> > > >
> > > >
> > > >
> > > >
> > > > On Wednesday, May 13, 2015 at 23:16, Jonathan Seale <
> > > > jonathanpseale@gmail.com>, wrote:
> > > > Scientists,
> > > >
> > > >
> > > > I have an astrophysical application for Mahout that I need help with.
> > > >
> > > >
> > > > I have 1-dimensional stellar spectra for many, many stars. Each
> > spectrum
> > > >
> > > > consists of a series of intensity values, one per wavelength of
> light.
> > I
> > > >
> > > > need to be able to find the cosine similarity between ALL pairs of
> > stars.
> > > >
> > > > Seems to me this is simply a user-user similarity problem where I
> have
> > > >
> > > > stars instead of users, wavelengths instead of items, and intensities
> > > >
> > > > instead of ratings/clicks.
> > > >
> > > >
> > > > But I'm having difficulty using mahout's row similarity package (I'm
> > new
> > > to
> > > >
> > > > this, and these days astronomers code pretty exclusively in python).
> I
> > > know
> > > >
> > > > that I must have to 1) create a sparse matrix where each row is a
> star,
> > > >
> > > > columns are wavelengths, and the values are intensity, and 2)
> implement
> > > row
> > > >
> > > > similarity. But I'm just not sure how to do it. Anyone have a good
> > > resource
> > > >
> > > > or be willing to help? I could probably offer some compensation to
> > anyone
> > > >
> > > > that would be willing to provide a little focussed, personalized
> > > > assistance.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Jonathan
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message