mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: RecommenderJob and NaN
Date Thu, 13 Oct 2011 06:37:15 GMT
RecommenderJob? The unit tests run it all the time.
There should not be any glitches with static variables -- don't think
there are any.

On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <goksron@gmail.com> wrote:
> Is this job working well for anyone now?
> When was the last time this job worked for someone?
>
> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <gsingers@apache.org>wrote:
>
>> Both local and on EC2
>>
>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote:
>>
>> > Hi Grant,
>> >
>> > Just curious, are you running this locally or distributed?
>> >
>> > I'd run into a similar issue, though in a completely different algorithm
>> (Jimmy Lin's PageRank implementation) due to the use of a static variable.
>> >
>> > When running locally, this wasn't getting cleared between loops, and thus
>> I got wonky results.
>> >
>> > The same thing would have happened with JVM reuse enabled.
>> >
>> > -- Ken
>> >
>> > On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote:
>> >
>> >> Digging some more:
>> >>
>> >> In AggregateAndRecommend, around lines 143, I have, for userId 0, a
>> simColumn of:
>> >>
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
>> >>
>> >> Which then becomes the numerator and the denom.
>> >>
>> >> Looping, my next simCol is:
>> >>
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}
>> >>
>> >> and then
>> >>
>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}
>> >>
>> >> ...
>> >>
>> >> Each time, those are getting added into the numerators/denoms value,
>> such that by the time we are done looping (line 161), we have:
>> >> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>> >> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
>> >>
>> >> numberOfSimilarItemsUsed:
>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}
>> >>
>> >> Not sure on how to interpret this as I haven't dug into the math here
>> yet or figured out where those NaN are coming from originally.
>> >>
>> >> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:
>> >>
>> >>>
>> >>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
>> >>>
>> >>>>
>> >>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>> >>>>
>> >>>>> Where is the NaN coming up -- what has this value?
>> >>>>
>> >>>> simColumn seems to be the originator in the Aggregate step.  For
>> instance, my current breakpoint shows:
>> >>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>> >>>>
>> >>>> I can also see some in the PartialMultiplyMapper via the
>> similarityMatrixColumn.
>> >>>>
>> >>>> Is that set by SimilarityMatrixRowWrapperMapper?
>> >>>> <code>
>> >>>> /* remove self similarity */
>> >>>> similarityMatrixRow.set(key.get(), Double.NaN);
>> >>>> </code>
>> >>>
>> >>> Ah, but that is just taking care of itself, so maybe not the issue.
>> >>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>> It should be propagated in some cases but not others. I'm not
aware
>> of
>> >>>>> any changes here.
>> >>>>
>> >>>> yeah, me neither.  This is all related to MAHOUT-798.
>> >>>>
>> >>>>>
>> >>>>> Generally small data sets will have this problem of not being
able to
>> >>>>> compute much of anything useful, so NaN might be right here.
>> >>>>> But you say it was different recently, which seems to rule that
out.
>> >>>>
>> >>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop,
>> it's just that's a whole lot harder to debug.
>> >>>>
>> >>>>>
>> >>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <
>> gsingers@apache.org> wrote:
>> >>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh)
and am not
>> getting any recommendations due to NaNs being calculated in the
>> AggregateAndRecommend step.  I'm not quite sure what is going on as it seems
>> like this was working as little as two weeks ago (post Sebastian's big
>> change to RecJob), but I don't see a whole lot of changes in that part of
>> the code.
>> >>>>>>
>> >>>>>> The data is user id's mapping to email thread ids.  My
input data is
>> simply a triple of user id, thread id, 1 (meaning that user participated in
>> that thread)  It seems like I will have a lot of good values in the inputs
>> to the AggregateAndRecommend step, except one id will be NaN and this then
>> seems to get added in and makes everything NaN (I realize this is a very
>> naive understanding).  I sense that I should be looking upstream in the
>> process for a fix, but I am not sure where that is.
>> >>>>>>
>> >>>>>> Any ideas where I should be looking to eliminate these NaNs?
 If you
>> want to try this with a small data set, you can get it here:
>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but note
the companion article is not published yet.)
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Grant
>> >>>>
>> >>>>
>> >>>
>> >>> --------------------------------------------
>> >>> Grant Ingersoll
>> >>> http://www.lucidimagination.com
>> >>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> >>>
>> >>
>> >> --------------------------------------------
>> >> Grant Ingersoll
>> >> http://www.lucidimagination.com
>> >> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>> >>
>> >
>> > --------------------------
>> > Ken Krugler
>> > +1 530-210-6378
>> > http://bixolabs.com
>> > custom big data solutions & training
>> > Hadoop, Cascading, Mahout & Solr
>> >
>> >
>> >
>>
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>
>>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Mime
View raw message