spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jey Kottalam <...@cs.berkeley.edu>
Subject Re: RDD pipe example. Is this a bug or a feature?
Date Sat, 20 Sep 2014 00:21:53 GMT
Your proposed use of rdd.pipe("foo") to communicate with an external
process seems fine. The "foo" program should read its input from
stdin, perform its computations, and write its results back to stdout.
Note that "foo" will be run on the workers, invoked once per
partition, and the result will be an RDD[String] containing an entry
for each line of output from your program.

-Jey

On Fri, Sep 19, 2014 at 3:59 PM, Andy Davidson
<Andy@santacruzintegration.com> wrote:
> Hi Jey
>
> Many thanks for the code example. Here is what I really want to do. I want
> to use Spark Stream and python. Unfortunately pySpark does not support
> streams yet. It was suggested the way to work around this was to use an RDD
> pipe. The example bellow was a little experiment.
>
> You can think of my system as following the standard unix shell script pipe
> design
>
> Stream of data -> spark -> down stream system not implemented in spark
>
> After seeing your example code I now understand how the stdin and stdout get
> configured.
>
> It seem like pipe() does not work the way I want. I guess I could open a
> socket and write to the down stream process.
>
> Any suggestions would be greatly appreciated
>
> Thanks Andy
>
> From: Jey Kottalam <jey@cs.berkeley.edu>
> Reply-To: <jey@cs.berkeley.edu>
> Date: Friday, September 19, 2014 at 12:35 PM
> To: Andrew Davidson <Andy@SantaCruzIntegration.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: RDD pipe example. Is this a bug or a feature?
>
> Hi Andy,
>
> That's a feature -- you'll have to print out the return value from
> collect() if you want the contents to show up on stdout.
>
> Probably something like this:
>
> for(Iterator<String> iter = rdd.pipe(pwd +
> "/src/main/bin/RDDPipe.sh").collect().iterator(); iter.hasNext();)
>    System.out.println(iter.next());
>
>
> Hope that helps,
> -Jey
>
> On Fri, Sep 19, 2014 at 11:21 AM, Andy Davidson
> <Andy@santacruzintegration.com> wrote:
>
> Hi
>
> I am wrote a little java job to try and figure out how RDD pipe works.
> Bellow is my test shell script. If in the script I turn on debugging I get
> output. In my console. If debugging is turned off in the shell script, I do
> not see anything in my console. Is this a bug or feature?
>
> I am running the job locally on a Mac
>
> Thanks
>
> Andy
>
>
> Here is my Java
>
>          rdd.pipe(pwd + "/src/main/bin/RDDPipe.sh").collect();
>
>
>
> #!/bin/sh
>
>
> #
>
> # Use this shell script to figure out how spark RDD pipe() works
>
> #
>
>
> set -x # turns shell debugging on
>
> #set +x # turns shell debugging off
>
>
> while read x ;
>
> do
>
> echo RDDPipe.sh $x ;
>
> Done
>
>
>
> Here is the output if debugging is turned on
>
> $ !grep
>
> grep RDDPipe run.sh.out
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 2
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 3
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 0
>
> $
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message