mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: RecommenderJob and NaN
Date Wed, 12 Oct 2011 18:30:53 GMT
Both local and on EC2

On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:

> Hi Grant,
> 
> Just curious, are you running this locally or distributed?
> 
> I'd run into a similar issue, though in a completely different algorithm (Jimmy Lin's
PageRank implementation) due to the use of a static variable.
> 
> When running locally, this wasn't getting cleared between loops, and thus I got wonky
results.
> 
> The same thing would have happened with JVM reuse enabled.
> 
> -- Ken
> 
> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
> 
>> Digging some more:
>> 
>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn of:
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>> 
>> Which then becomes the numerator and the denom.
>> 
>> Looping, my next simCol is:
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>> 
>> and then
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>> 
>> ...
>> 
>> Each time, those are getting added into the numerators/denoms value, such that by
the time we are done looping (line 161), we have:
>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>> 
>> numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>> 
>> Not sure on how to interpret this as I haven't dug into the math here yet or figured
out where those NaN are coming from originally.
>> 
>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>> 
>>> 
>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>>> 
>>>> 
>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>>>> 
>>>>> Where is the NaN coming up -- what has this value?
>>>> 
>>>> simColumn seems to be the originator in the Aggregate step.  For instance,
my current breakpoint shows:
>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>>>> 
>>>> I can also see some in the PartialMultiplyMapper via the similarityMatrixColumn.
 
>>>> 
>>>> Is that set by SimilarityMatrixRowWrapperMapper?
>>>> <code>
>>>> /* remove self similarity */
>>>> similarityMatrixRow.set(key.get(), Double.NaN);
>>>> </code>
>>> 
>>> Ah, but that is just taking care of itself, so maybe not the issue.
>>> 
>>>> 
>>>> 
>>>> 
>>>>> It should be propagated in some cases but not others. I'm not aware of
>>>>> any changes here.
>>>> 
>>>> yeah, me neither.  This is all related to MAHOUT-798.
>>>> 
>>>>> 
>>>>> Generally small data sets will have this problem of not being able to
>>>>> compute much of anything useful, so NaN might be right here.
>>>>> But you say it was different recently, which seems to rule that out.
>>>> 
>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's
just that's a whole lot harder to debug.
>>>> 
>>>>> 
>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gsingers@apache.org>
wrote:
>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am
not getting any recommendations due to NaNs being calculated in the AggregateAndRecommend
step.  I'm not quite sure what is going on as it seems like this was working as little as
two weeks ago (post Sebastian's big change to RecJob), but I don't see a whole lot of changes
in that part of the code.
>>>>>> 
>>>>>> The data is user id's mapping to email thread ids.  My input data
is simply a triple of user id, thread id, 1 (meaning that user participated in that thread)
 It seems like I will have a lot of good values in the inputs to the AggregateAndRecommend
step, except one id will be NaN and this then seems to get added in and makes everything NaN
(I realize this is a very naive understanding).  I sense that I should be looking upstream
in the process for a fix, but I am not sure where that is.
>>>>>> 
>>>>>> Any ideas where I should be looking to eliminate these NaNs?  If
you want to try this with a small data set, you can get it here: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout
(but note the companion article is not published yet.)
>>>>>> 
>>>>>> Thanks,
>>>>>> Grant
>>>> 
>>>> 
>>> 
>>> --------------------------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>> 
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> 
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message