spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toby Douglass <>
Subject Re: initial basic question from new user
Date Thu, 12 Jun 2014 11:35:46 GMT
On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas <> wrote:

> The goal of rdd.persist is to created a cached rdd that breaks the DAG
> lineage. Therefore, computations *in the same job* that use that RDD can
> re-use that intermediate result, but it's not meant to survive between job
> runs.

As I understand it, Spark is designed for interactive querying, in the
sense that the caching of intermediate results eliminates the need to
recompute those results.

However, if intermediate results last only for the duration of a job (e.g.
say a python script), how exactly is interactive querying actually
performed?   a script is not an interactive medium.  Is the shell the only
medium for interactive querying?

Consider a common usage case : a web-site, which offers reporting upon a
large data set.  Users issue arbitrary queries.  A few queries (just with
different arguments) dominate the query load, so we thought to create
intermediate RDDs to service those queries, so only those order of
magnitude or smaller RDDs would need to be processed.  Where this is not
possible, we can only use Spark for reporting by issuing each query over
the whole data set - e.g. Spark is just like Impala is just like Presto is
just like [nnn].  The enourmous benefit of RDDs - the entire point of Spark
- so profoundly useful here - is not available.  What a huge and unexpected
loss!  Spark seemingly renders itself ordinary.  It is for this reason I am
surprised to find this functionality is not available.

> If you need to ad-hoc persist to files, you can can save RDDs using
> rdd.saveAsObjectFile(...) [1] and load them afterwards using
> sparkContext.objectFile(...)

I've been using this site for docs;

Here we find through the top-of-the-page menus the link "API Docs" ->
""Python API" which brings us to;

Where this page does not show the function saveAsObjectFile().

I find now from your link here;

What appears to be a second and more complete set of the same
documentation, using a different web-interface to boot.

It appears at least that there are two sets of documentation for the same
APIs, where one set is out of the date and the other not, and the out of
date set is that which is linked to from the main site?

Given that our agg sizes will exceed memory, we expect to cache them to
disk, so save-as-object (assuming there are no out of the ordinary
performance issues) may solve the problem, but I was hoping to store data
is a column orientated format.  However I think this in general is not
possible - Spark can *read* Parquet, but I think it cannot write Parquet as
a disk-based RDD format.

If you want to preserve the RDDs in memory between job runs, you should
> look at the Spark-JobServer [3]


I view this with some trepidation.  It took two man-days to get Spark
running (and I've spent another man day now trying to get a map/reduce to
run; I'm getting there, but not there yet) - the bring-up/config experience
for end-users is not tested or accurated documented (although to be clear,
no better and no worse than is normal for open source; Spark is not
exceptional).  Having to bring up another open source project is a
significant barrier to entry; it's always such a headache.

The save-to-disk function you mentioned earlier will allow intermediate
RDDs to go to disk, but we do in fact have a use case where in-memory would
be useful; it might allow us to ditch Cassandra, which would be wonderful,
since it would reduce the system count by one.

I have to say, having to install JobServer to achieve this one end seems an
extraordinarily heavyweight solution - a whole new application, when all
that is wished for is that Spark persists RDDs across jobs, where so small
a feature seems to open the door to so much functionality.

View raw message