mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Clustering from DB
Date Mon, 27 Jul 2009 14:07:04 GMT
It's been over a year since I ran any tests of KMeans on larger data 
sets and there has been a lot of refactoring done in the interim. I was 
also using only dense vectors. It is entirely possible it is now doing 
something really poorly. I'm surprised that it is taking such a long 
time to munch such a small dataset but it sounds like you can reproduce 
it on a single machine so profiling should suggest the root cause. I'm 
going to be away from the computer for the next two weeks - a real 
vacation - so unfortunately I won't be able to contribute to this effort.

Jeff

Grant Ingersoll wrote:
>
> On Jul 27, 2009, at 12:00 AM, nfantone wrote:
>
>> Thanks, Grant. I just updated and notice the change.
>>
>> As a side note: you think someone could run some real tests on kMeans,
>> in particular, other than the ones already in the project? I bet there
>> are other naive (or not so naive) problems like that. After much
>> coding, reading and experimenting in the last weeks with clustering in
>> Mahout, I am inclined to say something may not fully work with kMeans,
>> as of now. Or perhaps it just needs some refactoring/performance
>> tweaks. Jeff have claimed to run the job over gigabytes of data, using
>> a rather small cluster, in minutes. Have anyone tried to accomplish
>> this recently (since the hadoop upgrade to 0.20)? Just use
>> ClusteringUtils to write a file of some (arguably not so) significant
>> number of random Vectors (say, 800.000+) and let that be the input of
>> a KMeansMRJob (testKMeansMRJob() could very well serve this purpose
>> with little change). You'll end up with a file of about ~85MB to
>> ~100MB, which can easily fit into memory in any modern computer. Now,
>> run the whole thing (I've tried both, locally and using a three
>> node-cluster setup - which, frankly, seemed like a bit too much
>> computing power for such small number of items in the dataset). It'll
>> take forever to complete.
>>
>
> I hope to hit this soon.  I've got some Amazon credits I need to use 
> and hope to put them towards this.
>
> As with any project in open source, we need people to kick the tires, 
> give feedback (thank you!) and also poke around the code to make it 
> better.
>
> Have you tried your data with some other clustering code, perhaps Weka 
> or something like that?
>
>
>> This simple methods could be used to generate any given number of
>> random SparseVectors for testing's sake, if anyone is interested:
>>
>>  private static Random rnd = new Random();
>>  private static final int CARDINALITY = 1200;
>>  private static final int MAX_NON_ZEROS = 200;
>>  private static final int MAX_VECTORS = 850000;
>>
>>  private static Vector getRandomVector() {
>>     Integer id = rnd.nextInt(Integer.MAX_VALUE);
>>     Vector v = new SparseVector(id.toString(), CARDINALITY);
>>     int nonZeros = 0;
>>     while ((nonZeros = rnd.nextInt(MAX_NON_ZEROS)) == 0);
>>     for (int i = 0; i < nonZeros; i++) {
>>         v.setQuick(rnd.nextInt(CARDINALITY), rnd.nextDouble());
>>     }
>>     return v;
>>  }
>>
>>  private static List<Vector> getVectors() {
>>       List<Vector> vectors = new ArrayList<Vector>(MAX_VECTORS);
>>       for (int i = 0; i < MAX_VECTORS; i++){
>>           vectors.add(getRandomVector());
>>       }
>>       return vectors;
>>  }
>>
>
>
> I'm not sure why testing with Random vectors would be all that useful 
> other than it shows it runs.  I wouldn't expect anything useful to 
> come out of it, though.
>
>
>> On Sun, Jul 26, 2009 at 10:30 PM, Grant 
>> Ingersoll<gsingers@apache.org> wrote:
>>> Fixed on MAHOUT-152
>>>
>>> On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:
>>>
>>>> That does indeed look like a problem.  I'll fix.
>>>>
>>>> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>>>>
>>>>> While (still) experiencing performance issues and inspecting kMeans
>>>>> code, I found this lying around SquaredEuclideanDistanceMeasure.java:
>>>>>
>>>>> public double distance(double centroidLengthSquare, Vector centroid,
>>>>> Vector v) {
>>>>>  if (centroid.size() != centroid.size()) {
>>>>>    throw new CardinalityException();
>>>>>  }
>>>>>  ...
>>>>>  }
>>>>>
>>>>> I bet someone meant to compare centroid and v sizes and didn't 
>>>>> noticed.
>>>>>
>>>>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nfantone@gmail.com>
wrote:
>>>>>>
>>>>>> Well, as it turned out, it didn't have anything to do with my
>>>>>> performance issue but I found out that writing a Cluster (with a
>>>>>> single vector as its center) to a file and then reading it, requires
>>>>>> the center to be added as point; otherwise, you won't be able to
>>>>>> retrieve it as it should. Therefore, one should do:
>>>>>>
>>>>>> // Writing
>>>>>> String id = "someID";
>>>>>> Vector v = new SparseVector();
>>>>>> Cluster c = new Cluster(v);
>>>>>> c.addPoint(v);
>>>>>> seqWriter.append(new Text(id), c);
>>>>>>
>>>>>> // Reading
>>>>>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>>>>>> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
>>>>>> while (seqReader.next(key, value)) {
>>>>>> ...
>>>>>> Vector centroid = value.getCenter();
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
>>>>>> this shouldn't happen. Then again, it's not that relevant, I guess.
>>>>>>
>>>>>> Sorry for bringing different subjects to the same thread.
>>>>>>
>>>>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nfantone@gmail.com>
wrote:
>>>>>>>
>>>>>>> I've been using RandomSeedGenerator to generate initial clusters

>>>>>>> for
>>>>>>> kMeans and while checking its code I stumbled upon this:
>>>>>>>
>>>>>>>    while (reader.next(key, value)) {
>>>>>>>      Cluster newCluster = new Cluster(value);
>>>>>>>      newCluster.addPoint(value);
>>>>>>>      ....
>>>>>>>    }
>>>>>>>
>>>>>>> I can see it adds the vector to the newly created cluster, even

>>>>>>> though
>>>>>>> it is setting it as its center in the constructor. Wasn't this
>>>>>>> corrected in a past revision? I thought this was not necessary
>>>>>>> anymore. I'll look into it a little bit more and see if this
has
>>>>>>> something to do with my lack of performance with my dataset.
>>>>>>>
>>>>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nfantone@gmail.com>

>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Perhaps a larger convergence value might help
(-d, I believe).
>>>>>>>>>>
>>>>>>>>>> I'll try that.
>>>>>>>>
>>>>>>>> There was no significant change while modifying the convergence

>>>>>>>> value.
>>>>>>>> At least, none was observed during the first three iterations

>>>>>>>> which
>>>>>>>> lasted the same amount of time than before, more or less.
>>>>>>>>
>>>>>>>>>>> Is there any chance your data is publicly shareable?
 Come 
>>>>>>>>>>> to think
>>>>>>>>>>> of
>>>>>>>>>>> it,
>>>>>>>>>>> with the vector representations, as long as you
don't 
>>>>>>>>>>> publish the
>>>>>>>>>>> key
>>>>>>>>>>> (which
>>>>>>>>>>> terms map to which index), I would think most
all data is 
>>>>>>>>>>> publicly
>>>>>>>>>>> shareable.
>>>>>>>>>>
>>>>>>>>>> I'm sorry, I don't quite understand what you're asking.
Publicly
>>>>>>>>>> shareable? As in user-permissions to access/read/write
the data?
>>>>>>>>>
>>>>>>>>> As in post a copy of the SequenceFile somewhere for download,
>>>>>>>>> assuming you
>>>>>>>>> can.  Then others could presumably try it out.
>>>>>>>>
>>>>>>>> My bad. Of course it is:
>>>>>>>>
>>>>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>>>>
>>>>>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>>>>>> SparseVector> logical format.
>>>>>>>>
>>>>>>>>> That does seem like an awfully long time for 62 MB on
a 6 node
>>>>>>>>> cluster. How many >terations are running?
>>>>>>>>
>>>>>>>> I'm running the whole thing with a 20 iterations cap. Every

>>>>>>>> iteration
>>>>>>>> - EXCEPT the first one which, oddly, lasted just two minutes-,

>>>>>>>> took
>>>>>>>> around 3hs to complete:
>>>>>>>>
>>>>>>>> Hadoop job_200907221734_0001
>>>>>>>> Finished in: 1mins, 42sec
>>>>>>>>
>>>>>>>> Hadoop job_200907221734_0004
>>>>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>>>>
>>>>>>>> Hadoop job_200907221734_0005
>>>>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>>>>
>>>>>>>>> How did you generate your initial clusters?
>>>>>>>>
>>>>>>>> I generate the initial clusters via the RandomSeedGenerator

>>>>>>>> setting a
>>>>>>>> 'k' value of 200.  This is what I did to initiate the process

>>>>>>>> for the
>>>>>>>> first time:
>>>>>>>>
>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>> input/user.data
>>>>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data
>>>>>>>> init/user.data
>>>>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i 
>>>>>>>> input/user.data -c
>>>>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>>>>
>>>>>>>>> Where are the iteration jobs spending most of their time
(map vs.
>>>>>>>>> reduce)
>>>>>>>>
>>>>>>>> I'm tempted to say map here, but their spent time is rather
>>>>>>>> comparable, actually. Reduce attempts are taking an hour
and a 
>>>>>>>> half to
>>>>>>>> end (average), and so are Map attempts. Here are some 
>>>>>>>> representative
>>>>>>>> examples from the web UI:
>>>>>>>>
>>>>>>>> reduce
>>>>>>>> attempt_200907221734_0002_r_000006_0
>>>>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>>>>
>>>>>>>> map
>>>>>>>> attempt_200907221734_0002_m_000000_0
>>>>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>>>>
>>>>>>>> Perhaps, there's some inconvenient in the way I create the
>>>>>>>> SequenceFile? I could share the JAVA code as well, if required.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
>>>> using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
>>> using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>


Mime
View raw message