mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Clustering from DB
Date Mon, 27 Jul 2009 15:03:34 GMT
Hmm, some profiling shows the pain is in the distance calculation for  
emitPointToNearestCluster.  Seems that we only use the optimized  
distance calculations for testing convergence, but shouldn't we also  
use it for calculating the distances to the cluster, too?


On Jul 27, 2009, at 10:19 AM, Grant Ingersoll wrote:

> I can confirm it is taking a while.  I spun up the dataset provided  
> and am on the first iteration, the mapper is at 50% and it has been  
> over an hour.
>
> Not a good sign.  I will try profiling.
>
> On Jul 27, 2009, at 10:07 AM, Jeff Eastman wrote:
>
>> It's been over a year since I ran any tests of KMeans on larger  
>> data sets and there has been a lot of refactoring done in the  
>> interim. I was also using only dense vectors. It is entirely  
>> possible it is now doing something really poorly. I'm surprised  
>> that it is taking such a long time to munch such a small dataset  
>> but it sounds like you can reproduce it on a single machine so  
>> profiling should suggest the root cause. I'm going to be away from  
>> the computer for the next two weeks - a real vacation - so  
>> unfortunately I won't be able to contribute to this effort.
>>
>> Jeff
>>
>> Grant Ingersoll wrote:
>>>
>>> On Jul 27, 2009, at 12:00 AM, nfantone wrote:
>>>
>>>> Thanks, Grant. I just updated and notice the change.
>>>>
>>>> As a side note: you think someone could run some real tests on  
>>>> kMeans,
>>>> in particular, other than the ones already in the project? I bet  
>>>> there
>>>> are other naive (or not so naive) problems like that. After much
>>>> coding, reading and experimenting in the last weeks with  
>>>> clustering in
>>>> Mahout, I am inclined to say something may not fully work with  
>>>> kMeans,
>>>> as of now. Or perhaps it just needs some refactoring/performance
>>>> tweaks. Jeff have claimed to run the job over gigabytes of data,  
>>>> using
>>>> a rather small cluster, in minutes. Have anyone tried to accomplish
>>>> this recently (since the hadoop upgrade to 0.20)? Just use
>>>> ClusteringUtils to write a file of some (arguably not so)  
>>>> significant
>>>> number of random Vectors (say, 800.000+) and let that be the  
>>>> input of
>>>> a KMeansMRJob (testKMeansMRJob() could very well serve this purpose
>>>> with little change). You'll end up with a file of about ~85MB to
>>>> ~100MB, which can easily fit into memory in any modern computer.  
>>>> Now,
>>>> run the whole thing (I've tried both, locally and using a three
>>>> node-cluster setup - which, frankly, seemed like a bit too much
>>>> computing power for such small number of items in the dataset).  
>>>> It'll
>>>> take forever to complete.
>>>>
>>>
>>> I hope to hit this soon.  I've got some Amazon credits I need to  
>>> use and hope to put them towards this.
>>>
>>> As with any project in open source, we need people to kick the  
>>> tires, give feedback (thank you!) and also poke around the code to  
>>> make it better.
>>>
>>> Have you tried your data with some other clustering code, perhaps  
>>> Weka or something like that?
>>>
>>>
>>>> This simple methods could be used to generate any given number of
>>>> random SparseVectors for testing's sake, if anyone is interested:
>>>>
>>>> private static Random rnd = new Random();
>>>> private static final int CARDINALITY = 1200;
>>>> private static final int MAX_NON_ZEROS = 200;
>>>> private static final int MAX_VECTORS = 850000;
>>>>
>>>> private static Vector getRandomVector() {
>>>>   Integer id = rnd.nextInt(Integer.MAX_VALUE);
>>>>   Vector v = new SparseVector(id.toString(), CARDINALITY);
>>>>   int nonZeros = 0;
>>>>   while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0);
>>>>   for (int i = 0; i < nonZeros; i++) {
>>>>       v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble());
>>>>   }
>>>>   return v;
>>>> }
>>>>
>>>> private static List<Vector> getVectors() {
>>>>     List<Vector> vectors = new ArrayList<Vector>(MAX_VECTORS);
>>>>     for (int i = 0; i < MAX_VECTORS; i++){
>>>>         vectors.add(getRandomVector());
>>>>     }
>>>>     return vectors;
>>>> }
>>>>
>>>
>>>
>>> I'm not sure why testing with Random vectors would be all that  
>>> useful other than it shows it runs.  I wouldn't expect anything  
>>> useful to come out of it, though.
>>>
>>>
>>>> On Sun, Jul 26, 2009 at 10:30 PM, Grant Ingersoll<gsingers@apache.org

>>>> > wrote:
>>>>> Fixed on MAHOUT-152
>>>>>
>>>>> On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:
>>>>>
>>>>>> That does indeed look like a problem.  I'll fix.
>>>>>>
>>>>>> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>>>>>>
>>>>>>> While (still) experiencing performance issues and inspecting
 
>>>>>>> kMeans
>>>>>>> code, I found this lying around  
>>>>>>> SquaredEuclideanDistanceMeasure.java:
>>>>>>>
>>>>>>> public double distance(double centroidLengthSquare, Vector  
>>>>>>> centroid,
>>>>>>> Vector v) {
>>>>>>> if (centroid.size() != centroid.size()) {
>>>>>>>  throw new CardinalityException();
>>>>>>> }
>>>>>>> ...
>>>>>>> }
>>>>>>>
>>>>>>> I bet someone meant to compare centroid and v sizes and didn't
 
>>>>>>> noticed.
>>>>>>>
>>>>>>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nfantone@gmail.com>
 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Well, as it turned out, it didn't have anything to do with
my
>>>>>>>> performance issue but I found out that writing a Cluster
 
>>>>>>>> (with a
>>>>>>>> single vector as its center) to a file and then reading it,
 
>>>>>>>> requires
>>>>>>>> the center to be added as point; otherwise, you won't be
able  
>>>>>>>> to
>>>>>>>> retrieve it as it should. Therefore, one should do:
>>>>>>>>
>>>>>>>> // Writing
>>>>>>>> String id = "someID";
>>>>>>>> Vector v = new SparseVector();
>>>>>>>> Cluster c = new Cluster(v);
>>>>>>>> c.addPoint(v);
>>>>>>>> seqWriter.append(new Text(id), c);
>>>>>>>>
>>>>>>>> // Reading
>>>>>>>> Writable key = (Writable)  
>>>>>>>> seqReader.getKeyClass().newInstance();
>>>>>>>> Cluster value = (Cluster)  
>>>>>>>> seqReader.getValueClass().newInstance();
>>>>>>>> while (seqReader.next(key, value)) {
>>>>>>>> ...
>>>>>>>> Vector centroid = value.getCenter();
>>>>>>>> ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'.
I  
>>>>>>>> think
>>>>>>>> this shouldn't happen. Then again, it's not that relevant,
I  
>>>>>>>> guess.
>>>>>>>>
>>>>>>>> Sorry for bringing different subjects to the same thread.
>>>>>>>>
>>>>>>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nfantone@gmail.com>
 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I've been using RandomSeedGenerator to generate initial
 
>>>>>>>>> clusters for
>>>>>>>>> kMeans and while checking its code I stumbled upon this:
>>>>>>>>>
>>>>>>>>>  while (reader.next(key, value)) {
>>>>>>>>>    Cluster newCluster = new Cluster(value);
>>>>>>>>>    newCluster.addPoint(value);
>>>>>>>>>    ....
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> I can see it adds the vector to the newly created cluster,
 
>>>>>>>>> even though
>>>>>>>>> it is setting it as its center in the constructor. Wasn't
this
>>>>>>>>> corrected in a past revision? I thought this was not
necessary
>>>>>>>>> anymore. I'll look into it a little bit more and see
if this  
>>>>>>>>> has
>>>>>>>>> something to do with my lack of performance with my dataset.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 23, 2009 at 3:45 PM,  
>>>>>>>>> nfantone<nfantone@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps a larger convergence value might
help (-d, I  
>>>>>>>>>>>>> believe).
>>>>>>>>>>>>
>>>>>>>>>>>> I'll try that.
>>>>>>>>>>
>>>>>>>>>> There was no significant change while modifying the
 
>>>>>>>>>> convergence value.
>>>>>>>>>> At least, none was observed during the first three
 
>>>>>>>>>> iterations which
>>>>>>>>>> lasted the same amount of time than before, more
or less.
>>>>>>>>>>
>>>>>>>>>>>>> Is there any chance your data is publicly
shareable?   
>>>>>>>>>>>>> Come to think
>>>>>>>>>>>>> of
>>>>>>>>>>>>> it,
>>>>>>>>>>>>> with the vector representations, as long
as you don't  
>>>>>>>>>>>>> publish the
>>>>>>>>>>>>> key
>>>>>>>>>>>>> (which
>>>>>>>>>>>>> terms map to which index), I would think
most all data  
>>>>>>>>>>>>> is publicly
>>>>>>>>>>>>> shareable.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm sorry, I don't quite understand what
you're asking.  
>>>>>>>>>>>> Publicly
>>>>>>>>>>>> shareable? As in user-permissions to access/read/write
 
>>>>>>>>>>>> the data?
>>>>>>>>>>>
>>>>>>>>>>> As in post a copy of the SequenceFile somewhere
for  
>>>>>>>>>>> download,
>>>>>>>>>>> assuming you
>>>>>>>>>>> can.  Then others could presumably try it out.
>>>>>>>>>>
>>>>>>>>>> My bad. Of course it is:
>>>>>>>>>>
>>>>>>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>>>>>>
>>>>>>>>>> That's the ~62Mb SequenceFile sample I've using,
in <Text,
>>>>>>>>>> SparseVector> logical format.
>>>>>>>>>>
>>>>>>>>>>> That does seem like an awfully long time for
62 MB on a 6  
>>>>>>>>>>> node
>>>>>>>>>>> cluster. How many >terations are running?
>>>>>>>>>>
>>>>>>>>>> I'm running the whole thing with a 20 iterations
cap. Every  
>>>>>>>>>> iteration
>>>>>>>>>> - EXCEPT the first one which, oddly, lasted just
two  
>>>>>>>>>> minutes-, took
>>>>>>>>>> around 3hs to complete:
>>>>>>>>>>
>>>>>>>>>> Hadoop job_200907221734_0001
>>>>>>>>>> Finished in: 1mins, 42sec
>>>>>>>>>>
>>>>>>>>>> Hadoop job_200907221734_0004
>>>>>>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>>>>>>
>>>>>>>>>> Hadoop job_200907221734_0005
>>>>>>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>>>>>>
>>>>>>>>>>> How did you generate your initial clusters?
>>>>>>>>>>
>>>>>>>>>> I generate the initial clusters via the RandomSeedGenerator
 
>>>>>>>>>> setting a
>>>>>>>>>> 'k' value of 200.  This is what I did to initiate
the  
>>>>>>>>>> process for the
>>>>>>>>>> first time:
>>>>>>>>>>
>>>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>>>> input/user.data
>>>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>>>> init/user.data
>>>>>>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver
-i input/ 
>>>>>>>>>> user.data -c
>>>>>>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>>>>>>
>>>>>>>>>>> Where are the iteration jobs spending most of
their time  
>>>>>>>>>>> (map vs.
>>>>>>>>>>> reduce)
>>>>>>>>>>
>>>>>>>>>> I'm tempted to say map here, but their spent time
is rather
>>>>>>>>>> comparable, actually. Reduce attempts are taking
an hour  
>>>>>>>>>> and a half to
>>>>>>>>>> end (average), and so are Map attempts. Here are
some  
>>>>>>>>>> representative
>>>>>>>>>> examples from the web UI:
>>>>>>>>>>
>>>>>>>>>> reduce
>>>>>>>>>> attempt_200907221734_0002_r_000006_0
>>>>>>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>>>>>>
>>>>>>>>>> map
>>>>>>>>>> attempt_200907221734_0002_m_000000_0
>>>>>>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>>>>>>
>>>>>>>>>> Perhaps, there's some inconvenient in the way I create
the
>>>>>>>>>> SequenceFile? I could share the JAVA code as well,
if  
>>>>>>>>>> required.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com/
>>>>>>
>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>>> Droids) using
>>>>>> Solr/Lucene:
>>>>>> http://www.lucidimagination.com/search
>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>> Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message