spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <>
Subject Re: How to hold some data in memory while processing rows in a DataFrame?
Date Tue, 23 Jan 2018 18:52:20 GMT
Thanks, but broadcast variables won't achieve won't I'm looking to do.  I'm
not trying to just share a one-time set of data across the cluster.
Rather, I'm trying to set up a small cache of info that's constantly being
updated based on the records in the dataframe.


On Mon, Jan 22, 2018 at 10:41 PM, naresh Goud <>

> If I understand your requirement correct.
> Use broadcast variables to replicate across all nodes the small amount of
> data you wanted to reuse.
> On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauch <>
> wrote:
>> This seems like an easy thing to do, but I've been banging my head
>> against the wall for hours trying to get it to work.
>> I'm processing a spark dataframe (in python).  What I want to do is, as
>> I'm processing it I want to hold some data from one record in some local
>> variables in memory, and then use those values later while I'm processing a
>> subsequent record.  But I can't see any way to do this.
>> I tried using:
>> ... and then reading/writing to local variables in the udf function, but
>> I can't get this to work properly.
>> My next guess would be to use dataframe.foreach(a_custom_function) and
>> try to save data to local variables in there, but I have a suspicion that
>> may not work either.
>> What's the correct way to do something like this in Spark?  In Hadoop I
>> would just go ahead and declare local variables, and read and write to them
>> in my map function as I like.  (Although with the knowledge that a) the
>> same map function would get repeatedly called for records with many
>> different keys, and b) there would be many different instances of my code
>> spread across many machines, and so each map function running on an
>> instance would only see a subset of the records.)  But in Spark it seems to
>> be extraordinarily difficult to create local variables that can be read
>> from / written to across different records in the dataframe.
>> Perhaps there's something obvious I'm missing here?  If so, any help
>> would be greatly appreciated!
>> Thanks,
>> DR

View raw message