mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Clustering from DB
Date Mon, 27 Jul 2009 01:19:57 GMT
That does indeed look like a problem.  I'll fix.

On Jul 26, 2009, at 2:37 PM, nfantone wrote:

> While (still) experiencing performance issues and inspecting kMeans
> code, I found this lying around SquaredEuclideanDistanceMeasure.java:
>
>  public double distance(double centroidLengthSquare, Vector centroid,
> Vector v) {
>    if (centroid.size() != centroid.size()) {
>      throw new CardinalityException();
>    }
>    ...
>   }
>
> I bet someone meant to compare centroid and v sizes and didn't  
> noticed.
>
> On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nfantone@gmail.com> wrote:
>> Well, as it turned out, it didn't have anything to do with my
>> performance issue but I found out that writing a Cluster (with a
>> single vector as its center) to a file and then reading it, requires
>> the center to be added as point; otherwise, you won't be able to
>> retrieve it as it should. Therefore, one should do:
>>
>> // Writing
>> String id = "someID";
>> Vector v = new SparseVector();
>> Cluster c = new Cluster(v);
>> c.addPoint(v);
>> seqWriter.append(new Text(id), c);
>>
>> // Reading
>> Writable key = (Writable) seqReader.getKeyClass().newInstance();
>> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
>> while (seqReader.next(key, value)) {
>> ...
>> Vector centroid = value.getCenter();
>> ...
>> }
>>
>> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
>> this shouldn't happen. Then again, it's not that relevant, I guess.
>>
>> Sorry for bringing different subjects to the same thread.
>>
>> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nfantone@gmail.com> wrote:
>>> I've been using RandomSeedGenerator to generate initial clusters for
>>> kMeans and while checking its code I stumbled upon this:
>>>
>>>      while (reader.next(key, value)) {
>>>        Cluster newCluster = new Cluster(value);
>>>        newCluster.addPoint(value);
>>>        ....
>>>      }
>>>
>>> I can see it adds the vector to the newly created cluster, even  
>>> though
>>> it is setting it as its center in the constructor. Wasn't this
>>> corrected in a past revision? I thought this was not necessary
>>> anymore. I'll look into it a little bit more and see if this has
>>> something to do with my lack of performance with my dataset.
>>>
>>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nfantone@gmail.com> wrote:
>>>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>>>
>>>>>> I'll try that.
>>>>
>>>> There was no significant change while modifying the convergence  
>>>> value.
>>>> At least, none was observed during the first three iterations which
>>>> lasted the same amount of time than before, more or less.
>>>>
>>>>>>> Is there any chance your data is publicly shareable?  Come to
 
>>>>>>> think of
>>>>>>> it,
>>>>>>> with the vector representations, as long as you don't publish
 
>>>>>>> the key
>>>>>>> (which
>>>>>>> terms map to which index), I would think most all data is  
>>>>>>> publicly
>>>>>>> shareable.
>>>>>>
>>>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>>>> shareable? As in user-permissions to access/read/write the data?
>>>>>
>>>>> As in post a copy of the SequenceFile somewhere for download,  
>>>>> assuming you
>>>>> can.  Then others could presumably try it out.
>>>>
>>>> My bad. Of course it is:
>>>>
>>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>>
>>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>>> SparseVector> logical format.
>>>>
>>>>> That does seem like an awfully long time for 62 MB on a 6 node  
>>>>> cluster. How many >terations are running?
>>>>
>>>> I'm running the whole thing with a 20 iterations cap. Every  
>>>> iteration
>>>> - EXCEPT the first one which, oddly, lasted just two minutes-, took
>>>> around 3hs to complete:
>>>>
>>>> Hadoop job_200907221734_0001
>>>> Finished in: 1mins, 42sec
>>>>
>>>> Hadoop job_200907221734_0004
>>>> Finished in: 2hrs, 34mins, 3sec
>>>>
>>>> Hadoop job_200907221734_0005
>>>> Finished in: 2hrs, 59mins, 34sec
>>>>
>>>>> How did you generate your initial clusters?
>>>>
>>>> I generate the initial clusters via the RandomSeedGenerator  
>>>> setting a
>>>> 'k' value of 200.  This is what I did to initiate the process for  
>>>> the
>>>> first time:
>>>>
>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/ 
>>>> user.data
>>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/ 
>>>> user.data
>>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/ 
>>>> user.data -c
>>>> init -o output -r 32 -d 0.01 -k 200
>>>>
>>>>> Where are the iteration jobs spending most of their time (map  
>>>>> vs. reduce)
>>>>
>>>> I'm tempted to say map here, but their spent time is rather
>>>> comparable, actually. Reduce attempts are taking an hour and a  
>>>> half to
>>>> end (average), and so are Map attempts. Here are some  
>>>> representative
>>>> examples from the web UI:
>>>>
>>>> reduce
>>>> attempt_200907221734_0002_r_000006_0
>>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>>
>>>> map
>>>> attempt_200907221734_0002_m_000000_0
>>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>>
>>>> Perhaps, there's some inconvenient in the way I create the
>>>> SequenceFile? I could share the JAVA code as well, if required.
>>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message