mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Musselman <andrew.mussel...@gmail.com>
Subject Re: Preserve contents of keys after running k-means
Date Sat, 06 Jul 2013 00:23:38 GMT
Seems like this ought to be a high-priority feature.  For the moment we are doing the dumb
thing but I'll take a look and see about a patch.

Thanks
Andrew

On Jul 5, 2013, at 2:53 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Andrew,
> 
> I was being somewhat stupid.  You are talking about a parallel program.
> There is no single counter.
> 
> The row number is what I was referring to.  Each process will have
> consecutive row numbers starting at 0.  These rows will correspond to a
> sequence of rows in the original data.  If you can cause each process to
> record these id's as they go by, you have the thing you need.
> 
> I haven't looked at this code in several years, however, so my suggestions
> may well be quite far from reasonable.
> 
> 
> 
> On Fri, Jul 5, 2013 at 2:34 PM, Andrew Musselman <andrew.musselman@gmail.com
>> wrote:
> 
>> Ted, I'm having a tough time finding the "internal ids" you mentioned..
>> Where are they output?
>> 
>> Thanks
>> 
>> 
>> On Fri, Jul 5, 2013 at 2:10 PM, Andrew Musselman <
>> andrew.musselman@gmail.com
>>> wrote:
>> 
>>> :)
>>> 
>>> Aha, we were only looking in the points directory, not inside the
>>> clustered points directory.  So if I understand, you're suggesting that
>> we
>>> use the key at the beginning of the clustered points as a one-to-one map.
>>> The number of unique keys in the output doesn't seem to line up with
>> that
>>> in the input.
>>> 
>>> We may do our dumb idea for now until we get a better handle on how the
>>> output is written.
>>> 
>>> Thanks!
>>> 
>>> 
>>> On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>>> 
>>>> Andrew,
>>>> 
>>>> That is a pretty clever solution.
>>>> 
>>>> I think that you can get by with a simpler solution by noting how the
>>>> internal id's are assigned (sequentially, I think).
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman <
>>>> andrew.musselman@gmail.com
>>>>> wrote:
>>>> 
>>>>> So how are people working around this without patching 0.7?
>>>> Downgrading to
>>>>> 0.6?
>>>>> 
>>>>> We're on a cluster where we don't have admin rights to patch Mahout.
>>>>> 
>>>>> Our dumb idea now is to hash the concatenated values of each vector
>> and
>>>>> pair that up with our original ids, then run another process on the
>>>> points
>>>>> results to hash the results, then join up on hash value to pull id
>>>> together
>>>>> with cluster #.
>>>>> 
>>>>> Anyone have a nicer solution to this at hand?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jul 5, 2013 at 1:02 PM, Suneel Marthi <
>> suneel_marthi@yahoo.com
>>>>>> wrote:
>>>>> 
>>>>>> Andrew,
>>>>>> 
>>>>>> This feature was available prior to Mahout 0.7 (clustering had
>> support
>>>>> for
>>>>>> Named Vectors) and was broken later. While this may not be fixed
in
>>>> the
>>>>>> soon to be Mahout 0.8, there is a JIRA that's open for this -
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-1030 that's been
>>>> targeted
>>>>>> for 0.9. Please feel free to submit a patch if you would like to
>> take
>>>> a
>>>>>> shot at it.
>>>>>> 
>>>>>> Suneel
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Andrew Musselman <andrew.musselman@gmail.com>
>>>>>> To: user@mahout.apache.org
>>>>>> Sent: Friday, July 5, 2013 3:05 PM
>>>>>> Subject: Preserve contents of keys after running k-means
>>>>>> 
>>>>>> 
>>>>>> Hi list
>>>>>> 
>>>>>> We are trying to do some k-means clustering and are wondering if
>>>> there's
>>>>> an
>>>>>> easy way to preserve the contents of the keys for the input records.
>>>>>> 
>>>>>> E.g.
>>>>>> 
>>>>>> 12345: (0,3,79,80)
>>>>>> 98765: (1,4,98,90)
>>>>>> 
>>>>>> where the vectors being clustered are the tuples and the keys are
>> some
>>>>> id.
>>>>>> 
>>>>>> When we run clusterdump with pointsDir specified we have the vectors
>>>> but
>>>>>> not the keys.  We're looking at NamedVector as a path to this
>>>> solution,
>>>>> as
>>>>>> well as looking at a mapping file between ordered integers and the
>>>> ids in
>>>>>> order.
>>>>>> 
>>>>>> Thanks for any advice.
>>>>>> 
>>>>>> Best
>>>>>> Andrew
>> 

Mime
View raw message