mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Clustering from DB
Date Mon, 27 Jul 2009 01:30:00 GMT
Fixed on MAHOUT-152

On Jul 26, 2009, at 9:19 PM, Grant Ingersoll wrote:

> That does indeed look like a problem.  I'll fix.
>
> On Jul 26, 2009, at 2:37 PM, nfantone wrote:
>
>> While (still) experiencing performance issues and inspecting kMeans
>> code, I found this lying around SquaredEuclideanDistanceMeasure.java:
>>
>> public double distance(double centroidLengthSquare, Vector centroid,
>> Vector v) {
>>   if (centroid.size() != centroid.size()) {
>>     throw new CardinalityException();
>>   }
>>   ...
>>  }
>>
>> I bet someone meant to compare centroid and v sizes and didn't  
>> noticed.
>>
>> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nfantone@gmail.com> wrote:
>>> Well, as it turned out, it didn't have anything to do with my
>>> performance issue but I found out that writing a Cluster (with a
>>> single vector as its center) to a file and then reading it, requires
>>> the center to be added as point; otherwise, you won't be able to
>>> retrieve it as it should. Therefore, one should do:
>>>
>>> // Writing
>>> String id = "someID";
>>> Vector v = new SparseVector();
>>> Cluster c = new Cluster(v);
>>> c.addPoint(v);
>>> seqWriter.append(new Text(id), c);
>>>
>>> // Reading
>>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>>> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
>>> while (seqReader.next(key, value)) {
>>> ...
>>> Vector centroid = value.getCenter();
>>> ...
>>> }
>>>
>>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
>>> this shouldn't happen. Then again, it's not that relevant, I guess.
>>>
>>> Sorry for bringing different subjects to the same thread.
>>>
>>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nfantone@gmail.com> wrote:
>>>> I've been using RandomSeedGenerator to generate initial clusters  
>>>> for
>>>> kMeans and while checking its code I stumbled upon this:
>>>>
>>>>     while (reader.next(key, value)) {
>>>>       Cluster newCluster = new Cluster(value);
>>>>       newCluster.addPoint(value);
>>>>       ....
>>>>     }
>>>>
>>>> I can see it adds the vector to the newly created cluster, even  
>>>> though
>>>> it is setting it as its center in the constructor. Wasn't this
>>>> corrected in a past revision? I thought this was not necessary
>>>> anymore. I'll look into it a little bit more and see if this has
>>>> something to do with my lack of performance with my dataset.
>>>>
>>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nfantone@gmail.com>  
>>>> wrote:
>>>>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>>>>
>>>>>>> I'll try that.
>>>>>
>>>>> There was no significant change while modifying the convergence  
>>>>> value.
>>>>> At least, none was observed during the first three iterations  
>>>>> which
>>>>> lasted the same amount of time than before, more or less.
>>>>>
>>>>>>>> Is there any chance your data is publicly shareable?  Come
to  
>>>>>>>> think of
>>>>>>>> it,
>>>>>>>> with the vector representations, as long as you don't publish
 
>>>>>>>> the key
>>>>>>>> (which
>>>>>>>> terms map to which index), I would think most all data is
 
>>>>>>>> publicly
>>>>>>>> shareable.
>>>>>>>
>>>>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>>>>> shareable? As in user-permissions to access/read/write the data?
>>>>>>
>>>>>> As in post a copy of the SequenceFile somewhere for download,  
>>>>>> assuming you
>>>>>> can.  Then others could presumably try it out.
>>>>>
>>>>> My bad. Of course it is:
>>>>>
>>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>>
>>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>>> SparseVector> logical format.
>>>>>
>>>>>> That does seem like an awfully long time for 62 MB on a 6 node  
>>>>>> cluster. How many >terations are running?
>>>>>
>>>>> I'm running the whole thing with a 20 iterations cap. Every  
>>>>> iteration
>>>>> - EXCEPT the first one which, oddly, lasted just two minutes-,  
>>>>> took
>>>>> around 3hs to complete:
>>>>>
>>>>> Hadoop job_200907221734_0001
>>>>> Finished in: 1mins, 42sec
>>>>>
>>>>> Hadoop job_200907221734_0004
>>>>> Finished in: 2hrs, 34mins, 3sec
>>>>>
>>>>> Hadoop job_200907221734_0005
>>>>> Finished in: 2hrs, 59mins, 34sec
>>>>>
>>>>>> How did you generate your initial clusters?
>>>>>
>>>>> I generate the initial clusters via the RandomSeedGenerator  
>>>>> setting a
>>>>> 'k' value of 200.  This is what I did to initiate the process  
>>>>> for the
>>>>> first time:
>>>>>
>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data  
>>>>> input/user.data
>>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/ 
>>>>> user.data
>>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/ 
>>>>> user.data -c
>>>>> init -o output -r 32 -d 0.01 -k 200
>>>>>
>>>>>> Where are the iteration jobs spending most of their time (map  
>>>>>> vs. reduce)
>>>>>
>>>>> I'm tempted to say map here, but their spent time is rather
>>>>> comparable, actually. Reduce attempts are taking an hour and a  
>>>>> half to
>>>>> end (average), and so are Map attempts. Here are some  
>>>>> representative
>>>>> examples from the web UI:
>>>>>
>>>>> reduce
>>>>> attempt_200907221734_0002_r_000006_0
>>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>>
>>>>> map
>>>>> attempt_200907221734_0002_m_000000_0
>>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>>
>>>>> Perhaps, there's some inconvenient in the way I create the
>>>>> SequenceFile? I could share the JAVA code as well, if required.
>>>>>
>>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message