spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chawla,Sumit " <sumitkcha...@gmail.com>
Subject Re: Output Side Effects for different chain of operations
Date Thu, 15 Dec 2016 21:44:14 GMT
I am already creating these files on slave.  How can i create an RDD from
these slaves?

Regards
Sumit Chawla


On Thu, Dec 15, 2016 at 11:42 AM, Reynold Xin <rxin@databricks.com> wrote:

> You can just write some files out directly (and idempotently) in your
> map/mapPartitions functions. It is just a function that you can run
> arbitrary code after all.
>
>
> On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit <sumitkchawla@gmail.com>
> wrote:
>
>> Any suggestions on this one?
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <sumitkchawla@gmail.com>
>> wrote:
>>
>>> Hi All
>>>
>>> I have a workflow with different steps in my program. Lets say these are
>>> steps A, B, C, D.  Step B produces some temp files on each executor node.
>>> How can i add another step E which consumes these files?
>>>
>>> I understand the easiest choice is  to copy all these temp files to any
>>> shared location, and then step E can create another RDD from it and work on
>>> that.  But i am trying to avoid this copy.  I was wondering if there is any
>>> way i can queue up these files for E as they are getting generated on
>>> executors.  Is there any possibility of creating a dummy RDD in start of
>>> program, and then push these files into this RDD from each executor.
>>>
>>> I take my inspiration from the concept of Side Outputs in Google
>>> Dataflow:
>>>
>>> https://cloud.google.com/dataflow/model/par-do#emitting-to-s
>>> ide-outputs-in-your-dofn
>>>
>>>
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>

Mime
View raw message