mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Bonerz <jbon...@googlemail.com>
Subject Re: What are the best settings for my clustering task
Date Thu, 03 Oct 2013 14:30:48 GMT
I created a series of scripts to try out streamingkmeans in mahout an
increased the number of clusters to a high amount as suggested by Ted.
Everything seems to work. However, I can't figure out how to access the
actual cluster data at the end of the process.

It just gives me output that I cannot really understand... I would expect
my product_ids being referenced to cluster ids...

Example of the procedure's output:

hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
locally
Input Path: file:MahoutCluster/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
Key: 0: Value: key = 8678, weight = 3.00, vector =
{37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
Key: 1: Value: key = 3118, weight = 14.00, vector =
{19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
Key: 2: Value: key = 2867, weight = 3.00, vector =
{15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
Key: 3: Value: key = 6295, weight = 1.00, vector =
{17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
Key: 4: Value: key = 6725, weight = 4.00, vector =
{10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
Key: 5:......



this is my recipe:

--------------------------------------------------------------------------------------------------------
Step 1
Create a seqfile from my data with Python. Its the product_id (key) and the
short normalized descripti (value) that is written into the sequence file.



--------------------------------------------------------------------------------------------------------
Step 2
create vectors from that data with the following command:

mahout seq2sparse \
   -i productClusterSequenceData/productClusterSequenceData.seq \
   -o productClusterSequenceData/vectors \



--------------------------------------------------------------------------------------------------------
Step 3
Cluster the vectors using streamingkeans with this command:

mahout streamingkmeans \
-i productClusterSequenceData/vectors/tfidf-vectors \
-o MahoutCluster \
--tempDir /tmp \
-ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
-k 10000 -km 500000 \



--------------------------------------------------------------------------------------------------------
Step 4
Export the streamingkmeans cluster data into a textfile (for an extract of
the result see above)

mahout seqdumper \
-i MahoutCluster > similarProducts.txt

What am I missing?





2013/10/3 Ted Dunning <ted.dunning@gmail.com>

> Yes.  That will work.
>
> The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
> \approx 30 so the sketch will have at about 300,000 weighted centroids in
> it.  The final clustering will have to process these centroids to produce
> the desired 5,000 clusters.  Since 300,000 is a relatively small number of
> data points, this clustering step should proceed relatively quickly.
>
>
>
> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <jbonerz@googlemail.com>
> wrote:
>
> > thx for your elaborate answer.
> >
> > so if the upper bound on the final number of clusters is unknown in the
> > beginning, what would happen, if I define a very high number that is
> > guaranteed to be > the estimated number of clusters.
> > for example if I set it to 10.000 clusters if an estimate of 5.000 is
> > likely, will that work?
> >
> >
> > 2013/10/2 Ted Dunning <ted.dunning@gmail.com>
> >
> > > The way that the new streaming k-means works is that there is a first
> > > sketch pass which only requires an upper bound on the final number of
> > > clusters you will want.  It adaptively creates more or less clusters
> > > depending on the data and your bound.  This sketch is guaranteed to be
> > > computed within at most one map-reduce pass.  There is a threaded
> version
> > > that runs (fast) on a single machine.  The threaded version is liable
> to
> > be
> > > faster than the map-reduce version for moderate or smaller data sizes.
> > >
> > > That sketch can then be used to do all kinds of things that rely on
> > > Euclidean distance and still get results within a small factor of the
> > same
> > > algorithm applied to all of the data.  Typically this second phase is a
> > > ball k-means algorithm, but it could easily be a dp-means algorithm [1]
> > if
> > > you want a variable number of clusters.  Indeed, you could run many
> > > dp-means passes with different values of lambda on the same sketch.
>  Note
> > > that the sketch is small enough that in-memory clustering is entirely
> > > viable and is very fast.
> > >
> > > For the problem you describe, however, you probably don't need the
> sketch
> > > approach at all and can probably apply ball k-means or dp-means
> directly.
> > >  Running many k-means clusterings with differing values of k should be
> > > entirely feasible as well with such data sizes.
> > >
> > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> > >
> > >
> > >
> > >
> > > On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jbonerz@googlemail.com>
> > > wrote:
> > >
> > > > Isn't the streaming k-means just a different approach to crunch
> through
> > > the
> > > > data? In other words, the result of streaming k-means should be
> > > comparable
> > > > to using k-means in multiple chained map reduce cycles?
> > > >
> > > > I just read a paper about the k-means clustering and its underlying
> > > > algorithm.
> > > >
> > > > According to that paper, k-means relies on a preknown/predefined
> amount
> > > of
> > > > clusters as an input parameter.
> > > >
> > > > Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> > > >
> > > > In my current scenario however, the number of clusters is unknown at
> > the
> > > > beginning.
> > > >
> > > > Maybe k-means is just not the right algorithm for clustering similar
> > > > products based on their short description text? What else could I
> use?
> > > >
> > > >
> > > >
> > > >
> > > > 2013/10/1 Ted Dunning <ted.dunning@gmail.com>
> > > >
> > > > > At such small sizes, I would guess that the sequential version of
> the
> > > > > streaming k-means or ball k-means would be better options.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> > jbonerz@googlemail.com
> > > > > >wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > I am currently trying create clusters from a group of 50.000
> > strings
> > > > that
> > > > > > contain product descriptions (around 70-100 characters length
> > each).
> > > > > >
> > > > > > That group of 50.000 consists of roughly 5.000 individual
> products
> > > and
> > > > > ten
> > > > > > varying product descriptions per product. The product
> descriptions
> > > are
> > > > > > already prepared for clustering and contain a normalized brand
> > name,
> > > > > > product
> > > > > > model number, etc.
> > > > > >
> > > > > > What would be a good approach to maximise the amound of found
> > > clusters
> > > > > (the
> > > > > > best possible value would be 5.000 clusters with 10 products
> each)
> > > > > >
> > > > > > I adapted the reuters cluster script to read in my data and
> managed
> > > to
> > > > > > create a first set of clusters. However, I have not managed
to
> > > maximise
> > > > > the
> > > > > > cluster count.
> > > > > >
> > > > > > The question is: what do I need to tweak with regard to the
> > available
> > > > > > mahout
> > > > > > settings, so the clusters are created as precisely as possible?
> > > > > >
> > > > > > Many regards!
> > > > > > Jens
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > > > > Sent from the Mahout User List mailing list archive at
> Nabble.com.
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > CEO
> > Hightech Marketing Group
> > Cell/Mobile: +49 173 539 3588
> >
> > ____
> >
> > Hightech Marketing Group
> > Frankenstraße 32
> > 50354 Huerth
> > Germany
> > Phone: +49 (0)2233 – 619 2741
> > Fax: +49 (0)2233 – 619 27419
> > Web: www.hightechmg.com
> >
>



-- 
CEO
Hightech Marketing Group
Cell/Mobile: +49 173 539 3588

____

Hightech Marketing Group
Frankenstraße 32
50354 Huerth
Germany
Phone: +49 (0)2233 – 619 2741
Fax: +49 (0)2233 – 619 27419
Web: www.hightechmg.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message