mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Interesting MapReduce variant: MapFreeduce
Date Sun, 15 May 2011 18:23:04 GMT
Yeah as I understand it has to stream data to and from the worker as the
sandbox allows no access to the file system or network (other than the
originating host). On the plus side -- limits the damage this can do to a
user's PC.

And yes this strikes me as one of the key issues with the model. It works OK
for smallish jobs or those with more CPU-intensive nature than I/O. I think
this grew out of a distributed computing technology built to handle
BOINC-style physics simulations, indeed.

It's not going to be a good model for a lot of problems -- it's cool enough
to warrant thinking about what it might be good for. If you can afford
long-running jobs that throttle network usage and all that, could be a
cheap-o way for a small organization to do something interesting with

On Sun, May 15, 2011 at 6:30 PM, Jeremy Lewi <> wrote:

> Thanks for the link Sean.
> Whenever I looked into recovering wasted compute cycles (e.g by letting
> a job scheduler like sun grid engine fire off jobs during downtime) we
> found that the hassle of administering such a heterogeneous environment
> wasn't worth it. Maybe running as an applet under hadoop, and the
> implied virtual environment will make that easier.
> If you're running in an applet without hdfs, doesn't that mean "your
> moving both data and computation to the machine" as opposed to moving
> "computation to the data?". Would this be a big issue for mahout? For
> example,  if you're running kmeans and 90% of your machines are
> workstations that would otherwise be idle, then wouldn't you need to
> transfer roughly 90% of your dataset to the various clients (e.g client
> might only receive a small fraction but you 90% needs to be shipped out
> of your central storage)? It seems like network bottlenecks could easily
> swamp the benefits of using workstation cycles.
> J
> On Sun, 2011-05-15 at 18:09 +0100, Sean Owen wrote:
> > Hi all, in my travels I've come across a small interesting startup that I
> > thought might be of interest to the user@ audience. It's MapFreeduce (
> >, and they're spinning an interesting twist on
> > MapReduce. They've constructed a simplified MapReduce API, one for which
> > workers are able to run as Java applets in the browser sandbox.
> >
> > It's interesting for two reasons, I can tell you, after playing with it
> > myself. One, I think it's interesting as it asks whether a simpler
> version
> > of MapReduce than what you get in Hadoop is viable. That is -- it's not
> > Hadoop. Can you do something interesting without, say, direct access to
> > HDFS? Combiners? custom InputFormats? And two, since it can fairly
> > automatically turn office PCs with a browser into a safe background MR
> > worker, might let organizational skunk-works create a cluster for cheap
> out
> > of truly unused cycles to do something interesting.
> >
> > I managed to reconstruct parts of the recommender pipeline on this
> framework
> > without too much modification. It is possible to 'port' some parts of
> Mahout
> > to this framework, if not all. MapReduce fans will probably enjoy taking
> a
> > look at what they can get away with in a browser sandbox.
> >
> > From a conversation with their founder I know they'd really like feedback
> > and testers. Here's their pitch and plea for beta users in their own
> words.
> > (I have no affiliation with or interest in the company.)
> >
> >
> > *" is a Washington DC-based startup making Big Data
> > accessible to everyone. Our software service enables users to quickly and
> > easily build a mapreduce cluster from the spare CPU-cycles of available
> > computers without installing or configuring any software. To add a node
> to
> > your MapFreeduce cluster and increase its power, you simply click on a
> link
> > from any idle computer. You can scale your cluster to thousands of nodes
> to
> > perform computation- and data-intensive tasks such as web indexing, data
> > mining, business analytics, data warehousing, machine learning, financial
> > analysis, scientific simulation, and bioinformatics research. MapFreeduce
> > allows you to focus on crunching your data without having to worry about
> > either the cost and complexity of setting up a traditional hardware
> cluster
> > or the perpetual fees charged per hour and per node by common cloud
> > providers.
> >
> > We are looking for individuals that would be interested in joining our
> free,
> > private beta test and/or providing feedback to our service."*

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message