mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Clustering from DB
Date Mon, 27 Jul 2009 14:19:22 GMT
I can confirm it is taking a while.  I spun up the dataset provided  
and am on the first iteration, the mapper is at 50% and it has been  
over an hour.

Not a good sign.  I will try profiling.

On Jul 27, 2009, at 10:07 AM, Jeff Eastman wrote:

> It's been over a year since I ran any tests of KMeans on larger data  
> sets and there has been a lot of refactoring done in the interim. I  
> was also using only dense vectors. It is entirely possible it is now  
> doing something really poorly. I'm surprised that it is taking such  
> a long time to munch such a small dataset but it sounds like you can  
> reproduce it on a single machine so profiling should suggest the  
> root cause. I'm going to be away from the computer for the next two  
> weeks - a real vacation - so unfortunately I won't be able to  
> contribute to this effort.
>
> Jeff
>
> Grant Ingersoll wrote:
>>
>> On Jul 27, 2009, at 12:00 AM, nfantone wrote:
>>
>>> Thanks, Grant. I just updated and notice the change.
>>>
>>> As a side note: you think someone could run some real tests on  
>>> kMeans,
>>> in particular, other than the ones already in the project? I bet  
>>> there
>>> are other naive (or not so naive) problems like that. After much
>>> coding, reading and experimenting in the last weeks with  
>>> clustering in
>>> Mahout, I am inclined to say something may not fully work with  
>>> kMeans,
>>> as of now. Or perhaps it just needs some refactoring/performance
>>> tweaks. Jeff have claimed to run the job over gigabytes of data,  
>>> using
>>> a rather small cluster, in minutes. Have anyone tried to accomplish
>>> this recently (since the hadoop upgrade to 0.20)? Just use
>>> ClusteringUtils to write a file of some (arguably not so)  
>>> significant
>>> number of random Vectors (say, 800.000+) and let that be the input  
>>> of
>>> a KMeansMRJob (testKMeansMRJob() could very well serve this purpose
>>> with little change). You'll end up with a file of about ~85MB to
>>> ~100MB, which can easily fit into memory in any modern computer.  
>>> Now,
>>> run the whole thing (I've tried both, locally and using a three
>>> node-cluster setup - which, frankly, seemed like a bit too much
>>> computing power for such small number of items in the dataset).  
>>> It'll
>>> take forever to complete.
>>>
>>
>> I hope to hit this soon.  I've got some Amazon credits I need to  
>> use and hope to put them towards this.
>>
>> As with any project in open source, we need people to kick the  
>> tires, give feedback (thank you!) and also poke around the code to  
>> make it better.
>>
>> Have you tried your data with some other clustering code, perhaps  
>> Weka or something like that?
>>
>>
>>> This simple methods could be used to generate any given number of
>>> random SparseVectors for testing's sake, if anyone is interested:
>>>
>>> private static Random rnd = new Random();
>>> private static final int CARDINALITY = 1200;
>>> private static final int MAX_NON_ZEROS = 200;
>>> private static final int MAX_VECTORS = 850000;
>>>
>>> private static Vector getRandomVector() {
>>>    Integer id = rnd.nextInt(Integer.MAX_VALUE);
>>>    Vector v = new SparseVector(id.toString(), CARDINALITY);
>>>    int nonZeros = 0;
>>>    while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0);
>>>    for (int i = 0; i < nonZeros; i++) {
>>>        v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble());
>>>    }
>>>    return v;
>>> }
>>>
>>> private static List<Vector> getVectors() {
>>>      List<Vector> vectors = new ArrayList<Vector>(MAX_VECTORS);
>>>      for (int i = 0; i < MAX_VECTORS; i++){
>>>          vectors.add(getRandomVector());
>>>      }
>>>      return vectors;
>>> }
>>>
>>
>>
>> I'm not sure why testing with Random vectors would be all that  
>> useful other than it shows it runs.  I wouldn't expect anything  
>> useful to come out of it, though.
>>
>>
>>> On Sun, Jul 26, 2009 at 10:30 PM, Grant Ingersoll<gsingers@apache.org 
>>> > wrote:
>>>> Fixed on MAHOUT-152
>>>>
>>>> On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:
>>>>
>>>>> That does indeed look like a problem.  I'll fix.
>>>>>
>>>>> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>>>>>
>>>>>> While (still) experiencing performance issues and inspecting  
>>>>>> kMeans
>>>>>> code, I found this lying around  
>>>>>> SquaredEuclideanDistanceMeasure.java:
>>>>>>
>>>>>> public double distance(double centroidLengthSquare, Vector  
>>>>>> centroid,
>>>>>> Vector v) {
>>>>>> if (centroid.size() != centroid.size()) {
>>>>>>   throw new CardinalityException();
>>>>>> }
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>> I bet someone meant to compare centroid and v sizes and didn't  
>>>>>> noticed.
>>>>>>
>>>>>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nfantone@gmail.com>
 
>>>>>> wrote:
>>>>>>>
>>>>>>> Well, as it turned out, it didn't have anything to do with my
>>>>>>> performance issue but I found out that writing a Cluster (with
a
>>>>>>> single vector as its center) to a file and then reading it, 

>>>>>>> requires
>>>>>>> the center to be added as point; otherwise, you won't be able
to
>>>>>>> retrieve it as it should. Therefore, one should do:
>>>>>>>
>>>>>>> // Writing
>>>>>>> String id = "someID";
>>>>>>> Vector v = new SparseVector();
>>>>>>> Cluster c = new Cluster(v);
>>>>>>> c.addPoint(v);
>>>>>>> seqWriter.append(new Text(id), c);
>>>>>>>
>>>>>>> // Reading
>>>>>>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>>>>>>> Cluster value = (Cluster)  
>>>>>>> seqReader.getValueClass().newInstance();
>>>>>>> while (seqReader.next(key, value)) {
>>>>>>> ...
>>>>>>> Vector centroid = value.getCenter();
>>>>>>> ...
>>>>>>> }
>>>>>>>
>>>>>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I
 
>>>>>>> think
>>>>>>> this shouldn't happen. Then again, it's not that relevant, I
 
>>>>>>> guess.
>>>>>>>
>>>>>>> Sorry for bringing different subjects to the same thread.
>>>>>>>
>>>>>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nfantone@gmail.com>
 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I've been using RandomSeedGenerator to generate initial 

>>>>>>>> clusters for
>>>>>>>> kMeans and while checking its code I stumbled upon this:
>>>>>>>>
>>>>>>>>   while (reader.next(key, value)) {
>>>>>>>>     Cluster newCluster = new Cluster(value);
>>>>>>>>     newCluster.addPoint(value);
>>>>>>>>     ....
>>>>>>>>   }
>>>>>>>>
>>>>>>>> I can see it adds the vector to the newly created cluster,
 
>>>>>>>> even though
>>>>>>>> it is setting it as its center in the constructor. Wasn't
this
>>>>>>>> corrected in a past revision? I thought this was not necessary
>>>>>>>> anymore. I'll look into it a little bit more and see if this
 
>>>>>>>> has
>>>>>>>> something to do with my lack of performance with my dataset.
>>>>>>>>
>>>>>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nfantone@gmail.com>
 
>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps a larger convergence value might
help (-d, I  
>>>>>>>>>>>> believe).
>>>>>>>>>>>
>>>>>>>>>>> I'll try that.
>>>>>>>>>
>>>>>>>>> There was no significant change while modifying the 

>>>>>>>>> convergence value.
>>>>>>>>> At least, none was observed during the first three  
>>>>>>>>> iterations which
>>>>>>>>> lasted the same amount of time than before, more or less.
>>>>>>>>>
>>>>>>>>>>>> Is there any chance your data is publicly
shareable?   
>>>>>>>>>>>> Come to think
>>>>>>>>>>>> of
>>>>>>>>>>>> it,
>>>>>>>>>>>> with the vector representations, as long
as you don't  
>>>>>>>>>>>> publish the
>>>>>>>>>>>> key
>>>>>>>>>>>> (which
>>>>>>>>>>>> terms map to which index), I would think
most all data is  
>>>>>>>>>>>> publicly
>>>>>>>>>>>> shareable.
>>>>>>>>>>>
>>>>>>>>>>> I'm sorry, I don't quite understand what you're
asking.  
>>>>>>>>>>> Publicly
>>>>>>>>>>> shareable? As in user-permissions to access/read/write
the  
>>>>>>>>>>> data?
>>>>>>>>>>
>>>>>>>>>> As in post a copy of the SequenceFile somewhere for
download,
>>>>>>>>>> assuming you
>>>>>>>>>> can.  Then others could presumably try it out.
>>>>>>>>>
>>>>>>>>> My bad. Of course it is:
>>>>>>>>>
>>>>>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>>>>>
>>>>>>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>>>>>>> SparseVector> logical format.
>>>>>>>>>
>>>>>>>>>> That does seem like an awfully long time for 62 MB
on a 6  
>>>>>>>>>> node
>>>>>>>>>> cluster. How many >terations are running?
>>>>>>>>>
>>>>>>>>> I'm running the whole thing with a 20 iterations cap.
Every  
>>>>>>>>> iteration
>>>>>>>>> - EXCEPT the first one which, oddly, lasted just two
 
>>>>>>>>> minutes-, took
>>>>>>>>> around 3hs to complete:
>>>>>>>>>
>>>>>>>>> Hadoop job_200907221734_0001
>>>>>>>>> Finished in: 1mins, 42sec
>>>>>>>>>
>>>>>>>>> Hadoop job_200907221734_0004
>>>>>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>>>>>
>>>>>>>>> Hadoop job_200907221734_0005
>>>>>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>>>>>
>>>>>>>>>> How did you generate your initial clusters?
>>>>>>>>>
>>>>>>>>> I generate the initial clusters via the RandomSeedGenerator
 
>>>>>>>>> setting a
>>>>>>>>> 'k' value of 200.  This is what I did to initiate the
 
>>>>>>>>> process for the
>>>>>>>>> first time:
>>>>>>>>>
>>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>>> input/user.data
>>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>>> init/user.data
>>>>>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/

>>>>>>>>> user.data -c
>>>>>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>>>>>
>>>>>>>>>> Where are the iteration jobs spending most of their
time  
>>>>>>>>>> (map vs.
>>>>>>>>>> reduce)
>>>>>>>>>
>>>>>>>>> I'm tempted to say map here, but their spent time is
rather
>>>>>>>>> comparable, actually. Reduce attempts are taking an hour
and  
>>>>>>>>> a half to
>>>>>>>>> end (average), and so are Map attempts. Here are some
 
>>>>>>>>> representative
>>>>>>>>> examples from the web UI:
>>>>>>>>>
>>>>>>>>> reduce
>>>>>>>>> attempt_200907221734_0002_r_000006_0
>>>>>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>>>>>
>>>>>>>>> map
>>>>>>>>> attempt_200907221734_0002_m_000000_0
>>>>>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>>>>>
>>>>>>>>> Perhaps, there's some inconvenient in the way I create
the
>>>>>>>>> SequenceFile? I could share the JAVA code as well, if
 
>>>>>>>>> required.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>> Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>> Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message