spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matan Safriel <>
Subject Re: Running a task over a single input
Date Wed, 28 Jan 2015 13:44:54 GMT

So I assume I can safely run a function *F* of mine within the spark driver
program, without dispatching it to the cluster (?), thereby sticking to one
piece of code for *both* a real cluster run over big data, and for small
on-demand runs for a single input (now and then), both scenarios using my
same code attached to the same application-specific configuration of my
business logic. Is that correct?

Can I still write its output the same way Spark actions allow for a real
distributed task?

Would I see it as a task in the monitoring UI (http://<driver-node>:4040)of
the driver?

Thanks for the newb support.


On Wed, Jan 28, 2015 at 12:19 PM, Sean Owen <> wrote:

> Processing one object isn't a distributed operation, and doesn't
> really involve Spark. Just invoke your function on your object in the
> driver; there's no magic at all to that.
> You can make an RDD of one object and invoke a distributed Spark
> operation on it, but assuming you mean you have it on the driver,
> that's wasteful. It just copies the object to another machine to
> invoke the function.
> On Wed, Jan 28, 2015 at 10:14 AM, Matan Safriel <>
> wrote:
> > Hi,
> >
> > How would I run a given function in Spark, over a single input object?
> > Would I first add the input to the file system, then somehow invoke the
> > Spark function on just that input? or should I rather twist the Spark
> > streaming api for it?
> >
> > Assume I'd like to run a piece of computation that normally runs over a
> > large dataset, over just one new added datum. I'm a bit reticent
> adapting my
> > code to Spark without knowing the limits of this scenario.
> >
> > Many thanks!
> > Matan

View raw message