mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Webster <>
Subject Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?
Date Fri, 22 Oct 2010 06:45:13 GMT
OK no worries then, thanks for explaining.  I was thinking of using the 
class as part of transforming ratings to z-scores as a preprocessing 
step (or a wrapping recommender), and although the accuracy issue 
wouldn't come up (since there are only up to few thousand ratings per 
user), the possible speed-up over millions of users might make it 
worthwhile from a value point of view (value = amount of improvement / 
amount of work to implement).  But since it's just averaging, it's 
trivial to do it myself in a different class.

A documenting note at the top of the class explaining the motivation 
behind the implementation might be useful to future code explorers, if 
it's not too much bother.  You could probably largely just copy and 
paste your post.

On 10/21/10 5:26 PM, Sean Owen wrote:
> This class is used mostly in slope-one. The priorities are memory usage,
> then speed, then accuracy in that context.
> Because of memory concerns, I don't want to store more than an int and
> double. So yes you can store a total and count. So we're left trading speed
> for accuracy. Your change is a bit more accurate but slower at reads (must
> divide to get the average).
> For slope one, which is read-heavy, and where accuracy is not a big deal,
> this isn't a worthwhile tradeoff I think.
> What's the use case you have in mind?
> On Thu, Oct 21, 2010 at 10:03 AM, gabeweb<>  wrote:
>> I'm a bit concerned by the implementation of FullRunningAverage.  It's
>> probably not a big deal, but because it performs division each time a new
>> datum is added, there will be a floating point error that will increase
>> over
>> time, and if you average together millions of data points, this error may
>> become significant.  I know that Java doubles have really huge precision,
>> but still.
>> In addition, the implementation seems rather inefficient in that it
>> performs
>> 6 math operations every time a datum is added.  Since adding to a running
>> average is likely to happen much more often than accessing the average,
>> wouldn't it be a lot more efficient to maintain the *total* and count
>> internally, rather than the current average and the count, and then when
>> someone requests the average, calculate the actual average on the fly by
>> dividing the total by the count?  Then it would just be two addition
>> operations to add a datum.  This would simultaneously eliminate the
>> accumulation of floating point errors.
>> If people agree, I'd be happy to fix this and submit a patch.  It's a
>> simple
>> thing, but it's sort of bothering me :)  Plus it will allow me to get
>> familiar with the process.
>> --
>> View this message in context:
>> Sent from the Mahout User List mailing list archive at
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message