spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chawla,Sumit " <sumitkcha...@gmail.com>
Subject Output Side Effects for different chain of operations
Date Tue, 13 Dec 2016 16:31:16 GMT
Hi All

I have a workflow with different steps in my program. Lets say these are
steps A, B, C, D.  Step B produces some temp files on each executor node.
How can i add another step E which consumes these files?

I understand the easiest choice is  to copy all these temp files to any
shared location, and then step E can create another RDD from it and work on
that.  But i am trying to avoid this copy.  I was wondering if there is any
way i can queue up these files for E as they are getting generated on
executors.  Is there any possibility of creating a dummy RDD in start of
program, and then push these files into this RDD from each executor.

I take my inspiration from the concept of Side Outputs in Google Dataflow:

https://cloud.google.com/dataflow/model/par-do#emitting-to-side-outputs-in-your-dofn



Regards
Sumit Chawla

Mime
View raw message