mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <thelabd...@gmail.com>
Subject Re: distributed RandomSampler job?
Date Tue, 09 Aug 2011 04:20:24 GMT
Hi Ted,

Thanks for the response. I'll implement, open a ticket, and post a patch
after I'm satisfied with the outcome.

Cheers,
Tim

On Mon, Aug 8, 2011 at 1:34 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> There is not such a thing now.  It should be relatively easy to build.  The
> simplest method is to have each mapper produce a full-sized sample which is
> sent to a single reducer which produces another sample.  The output of the
> mappers needs to have a count of items retained and items considered in
> order for this to work correctly.
>
> This cuts down on the amount of data that the reducer has to handle but is
> similar in many respects.
>
> On Mon, Aug 8, 2011 at 11:47 AM, Timothy Potter <thelabdude@gmail.com
> >wrote:
>
> > Is there a distributed Mahout job to produce a random sample for a large
> > collection of vectors stored in HDFS? For example, if I wanted only 2M
> > vectors randomly selected from the ASF mail archive vectors (~6M total),
> is
> > there a Mahout job to do this (I'm using trunk 0.6-SNAPSHOT)? If not, can
> > this be done in a distributed manner using multiple reducers or would I
> > have
> > to send all vectors to 1 reducer and then use RandomSampler in the single
> > reducer?
> >
> > Cheers,
> > Tim
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message