mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <nfant...@gmail.com>
Subject Re: Clustering from DB
Date Sun, 26 Jul 2009 18:37:35 GMT
While (still) experiencing performance issues and inspecting kMeans
code, I found this lying around SquaredEuclideanDistanceMeasure.java:

  public double distance(double centroidLengthSquare, Vector centroid,
Vector v) {
    if (centroid.size() != centroid.size()) {
      throw new CardinalityException();
    }
    ...
   }

I bet someone meant to compare centroid and v sizes and didn't noticed.

On Fri, Jul 24, 2009 at 12:38 PM, nfantone<nfantone@gmail.com> wrote:
> Well, as it turned out, it didn't have anything to do with my
> performance issue but I found out that writing a Cluster (with a
> single vector as its center) to a file and then reading it, requires
> the center to be added as point; otherwise, you won't be able to
> retrieve it as it should. Therefore, one should do:
>
> // Writing
> String id = "someID";
> Vector v = new SparseVector();
> Cluster c = new Cluster(v);
> c.addPoint(v);
> seqWriter.append(new Text(id), c);
>
> // Reading
> Writable key = (Writable) seqReader.getKeyClass().newInstance();
> Cluster value = (Cluster) seqReader.getValueClass().newInstance();
> while (seqReader.next(key, value)) {
> ...
> Vector centroid = value.getCenter();
> ...
> }
>
> This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
> this shouldn't happen. Then again, it's not that relevant, I guess.
>
> Sorry for bringing different subjects to the same thread.
>
> On Fri, Jul 24, 2009 at 9:14 AM, nfantone<nfantone@gmail.com> wrote:
>> I've been using RandomSeedGenerator to generate initial clusters for
>> kMeans and while checking its code I stumbled upon this:
>>
>>      while (reader.next(key, value)) {
>>        Cluster newCluster = new Cluster(value);
>>        newCluster.addPoint(value);
>>        ....
>>      }
>>
>> I can see it adds the vector to the newly created cluster, even though
>> it is setting it as its center in the constructor. Wasn't this
>> corrected in a past revision? I thought this was not necessary
>> anymore. I'll look into it a little bit more and see if this has
>> something to do with my lack of performance with my dataset.
>>
>> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<nfantone@gmail.com> wrote:
>>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>>
>>>>> I'll try that.
>>>
>>> There was no significant change while modifying the convergence value.
>>> At least, none was observed during the first three iterations which
>>> lasted the same amount of time than before, more or less.
>>>
>>>>>> Is there any chance your data is publicly shareable?  Come to think
of
>>>>>> it,
>>>>>> with the vector representations, as long as you don't publish the
key
>>>>>> (which
>>>>>> terms map to which index), I would think most all data is publicly
>>>>>> shareable.
>>>>>
>>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>>> shareable? As in user-permissions to access/read/write the data?
>>>>
>>>> As in post a copy of the SequenceFile somewhere for download, assuming you
>>>> can.  Then others could presumably try it out.
>>>
>>> My bad. Of course it is:
>>>
>>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>>
>>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>>> SparseVector> logical format.
>>>
>>>>That does seem like an awfully long time for 62 MB on a 6 node cluster. How
many >terations are running?
>>>
>>> I'm running the whole thing with a 20 iterations cap. Every iteration
>>> - EXCEPT the first one which, oddly, lasted just two minutes-, took
>>> around 3hs to complete:
>>>
>>> Hadoop job_200907221734_0001
>>> Finished in: 1mins, 42sec
>>>
>>> Hadoop job_200907221734_0004
>>> Finished in: 2hrs, 34mins, 3sec
>>>
>>> Hadoop job_200907221734_0005
>>> Finished in: 2hrs, 59mins, 34sec
>>>
>>>> How did you generate your initial clusters?
>>>
>>> I generate the initial clusters via the RandomSeedGenerator setting a
>>> 'k' value of 200.  This is what I did to initiate the process for the
>>> first time:
>>>
>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/user.data
>>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/user.data
>>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
>>> init -o output -r 32 -d 0.01 -k 200
>>>
>>>>Where are the iteration jobs spending most of their time (map vs. reduce)
>>>
>>> I'm tempted to say map here, but their spent time is rather
>>> comparable, actually. Reduce attempts are taking an hour and a half to
>>> end (average), and so are Map attempts. Here are some representative
>>> examples from the web UI:
>>>
>>> reduce
>>> attempt_200907221734_0002_r_000006_0
>>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>>
>>> map
>>> attempt_200907221734_0002_m_000000_0
>>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>>
>>> Perhaps, there's some inconvenient in the way I create the
>>> SequenceFile? I could share the JAVA code as well, if required.
>>>
>>
>

Mime
View raw message