spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Marin <ro...@rogersmarin.com>
Subject What is the best approach to perform concurrent updates from different jobs to a in memory dataframe registered as a temp table?
Date Mon, 29 Feb 2016 10:42:35 GMT
Hi all,

I have multiple (>100) jobs running concurrently (sharing the same hive
context) that are each appending new rows to the same dataframe registered
as a temp table.

Currently I am using unionAll and registering that dataframe again as a
temp table in each job:

Given an existing dataframe registered as the temp table "test":

//Create dataframe with new rows to append
val newRows = hiveContext.createDataframe (rows,schema)

//Retrieve existing dataframe and append the new dataframe via unionAll
val updatedDF=hiveContext.table("test").unionAll(newRows)

//uncache existing dataframe
hiveContext.uncacheTable("test")

//Register the updated DF as a temp table
updatedDF.registerTempTable("test")

//Cache the updated dataframe
hiveContext.table("test").cache

I am finding that using this approach can deplete memory very quickly since
each call to ".cache" in each of the jobs is creating a new entry in memory
for the same dataframe.

Does anyone know if theres a more optimal solution to the above?.

Thanks,
Roger

Mime
View raw message