yeah... unreused reusable primitives are of little help, but a Mahout big
data equivalent of the R summary function would handy to have. The fact is,
we already have the reusable bits anyway. It is common to want columnwise
summaries of big matrices. Useful summaries include:
a) moment based statistics like average and standard deviation
b) rank based statistics like min, max, 1, 5, 25, 50, 75, 95, and 99th
percentiles.
c) counts of positive, negative and all entries
d) for word or textlike data, the total number of unique items with
frequency greater than 0, 1, 5 and the top 510 most common items.
On Fri, May 6, 2011 at 6:58 AM, Sean Owen <srowen@gmail.com> wrote:
> Hadoop has something like this:
>
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/packagesummary.html
>
> I find there's a very strong and unfortunate tension between
> reusability and performance in some cases. Having a discrete stage to
> compute something like this is good; if it can be computed inline in a
> prior stage and output on the side, that's a big performance savings.
>
> I also find myself tempted to construct a bunch of M/R primitives. For
> now I am trying to restrict my thinking to refactoring pieces that can
> come out easily, and that are used already in at least one place.
>
> I suppose I mean: if you want to write primitive X and can't find one
> good use for it yet in Mahout, I'd hold off, but otherwise would
> surely add it and use it.
>
>
> On Fri, May 6, 2011 at 2:49 PM, Grant Ingersoll <gsingers@apache.org>
> wrote:
> > MAHOUT688 has a M/R job to calculate std. deviation for document
> frequencies so that it can prune noisy words. I'm thinking of making it a
> bit more generic and adding a stats package to org.apache.mahout.math.hadoop
> that contains this and other basic stats calculations (mean, variance, sum
> of squares, etc.) that operate in M/R.
> >
> > Is that useful or am I reinventing the wheel here or wasting time?
> Seems like such a beast should already exist, but a quick search didn't
> turn up much.
> >
> > Grant
>
