spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <daro...@gmail.com>
Subject Re: How to hold some data in memory while processing rows in a DataFrame?
Date Tue, 23 Jan 2018 18:52:20 GMT
Thanks, but broadcast variables won't achieve won't I'm looking to do.  I'm
not trying to just share a one-time set of data across the cluster.
Rather, I'm trying to set up a small cache of info that's constantly being
updated based on the records in the dataframe.

DR

On Mon, Jan 22, 2018 at 10:41 PM, naresh Goud <nareshgoud.dulam@gmail.com>
wrote:

> If I understand your requirement correct.
> Use broadcast variables to replicate across all nodes the small amount of
> data you wanted to reuse.
>
>
>
> On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauch <darose3@gmail.com>
> wrote:
>
>> This seems like an easy thing to do, but I've been banging my head
>> against the wall for hours trying to get it to work.
>>
>> I'm processing a spark dataframe (in python).  What I want to do is, as
>> I'm processing it I want to hold some data from one record in some local
>> variables in memory, and then use those values later while I'm processing a
>> subsequent record.  But I can't see any way to do this.
>>
>> I tried using:
>>
>> dataframe.select(a_custom_udf_function('some_column'))
>>
>> ... and then reading/writing to local variables in the udf function, but
>> I can't get this to work properly.
>>
>> My next guess would be to use dataframe.foreach(a_custom_function) and
>> try to save data to local variables in there, but I have a suspicion that
>> may not work either.
>>
>>
>> What's the correct way to do something like this in Spark?  In Hadoop I
>> would just go ahead and declare local variables, and read and write to them
>> in my map function as I like.  (Although with the knowledge that a) the
>> same map function would get repeatedly called for records with many
>> different keys, and b) there would be many different instances of my code
>> spread across many machines, and so each map function running on an
>> instance would only see a subset of the records.)  But in Spark it seems to
>> be extraordinarily difficult to create local variables that can be read
>> from / written to across different records in the dataframe.
>>
>> Perhaps there's something obvious I'm missing here?  If so, any help
>> would be greatly appreciated!
>>
>> Thanks,
>>
>> DR
>>
>>

Mime
View raw message