spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael David Pedersen <>
Subject Re: Efficient filtering on Spark SQL dataframes with ordered keys
Date Mon, 31 Oct 2016 14:16:11 GMT
Hi Mich,

Thank you again for your reply.

As I see you are caching the table already sorted
> val keyValRDDSorted = keyValRDD.sortByKey().cache
> and the next stage is you are creating multiple tempTables (different
> ranges) that cache a subset of rows already cached in RDD. The data stored
> in tempTable is in Hive columnar format (I assume that means ORC format)

But the thing is that I don't explicitly cache the tempTables, and I don't
really want to because I'll only run a single query on each tempTable. So I
expect the SQL query processor to operate directly on the underlying
key-value RDD, and my concern is that this may be inefficient.

> Well that is all you can do.

Ok, thanks - that's really what I wanted to get confirmation of.

> Bear in mind that these tempTables are immutable and I do not know any way
> of dropping tempTable to free more memory.

I'm assuming there won't be any (significant) memory overhead of
registering the temp tables as long as I don't explicitly cache them. Am I
wrong? In any case I'll be calling sqlContext.dropTempTable once the query
has completed, which according to the documentation should also free up


View raw message