spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toby Douglass <>
Subject initial basic question from new user
Date Thu, 12 Jun 2014 09:24:46 GMT

I am investigating Spark with a view to perform reporting on a large data
set, where the large data set receives additional data in the form of log
files on an hourly basis.

Where the data set is large there is a possibility we will create a range
of aggregate tables, to reduce the volume of data which has to be processed.

Having spent a little while reading up about Spark, my thought was that I
could create an RDD which is an agg, persist this to disk, have reporting
queries run against that RDD and when new data arrives, convert the new log
file into an agg and add that to the agg RDD.

However, I begin now to get the impression that RDDs cannot be persisted
across jobs - I can generate an RDD, I can persist it, but I can see no way
for a later job to load a persisted RDD (and I begin to think it will have
been GCed anyway, at the end of the first job).  Is this correct?

View raw message