spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From naresh Goud <>
Subject Re: How to hold some data in memory while processing rows in a DataFrame?
Date Tue, 23 Jan 2018 03:41:02 GMT
If I understand your requirement correct.
Use broadcast variables to replicate across all nodes the small amount of
data you wanted to reuse.

On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauch <>

> This seems like an easy thing to do, but I've been banging my head against
> the wall for hours trying to get it to work.
> I'm processing a spark dataframe (in python).  What I want to do is, as
> I'm processing it I want to hold some data from one record in some local
> variables in memory, and then use those values later while I'm processing a
> subsequent record.  But I can't see any way to do this.
> I tried using:
> ... and then reading/writing to local variables in the udf function, but I
> can't get this to work properly.
> My next guess would be to use dataframe.foreach(a_custom_function) and try
> to save data to local variables in there, but I have a suspicion that may
> not work either.
> What's the correct way to do something like this in Spark?  In Hadoop I
> would just go ahead and declare local variables, and read and write to them
> in my map function as I like.  (Although with the knowledge that a) the
> same map function would get repeatedly called for records with many
> different keys, and b) there would be many different instances of my code
> spread across many machines, and so each map function running on an
> instance would only see a subset of the records.)  But in Spark it seems to
> be extraordinarily difficult to create local variables that can be read
> from / written to across different records in the dataframe.
> Perhaps there's something obvious I'm missing here?  If so, any help would
> be greatly appreciated!
> Thanks,
> DR

View raw message