spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael David Pedersen <>
Subject Efficient filtering on Spark SQL dataframes with ordered keys
Date Mon, 31 Oct 2016 10:06:52 GMT

I've got a Spark SQL dataframe containing a "key" column. The queries I
want to run start by filtering on the key range. My question in outline: is
it possible to sort the dataset by key so as to do efficient key range
filters, before subsequently running a more complex SQL query?

I'm aware that such efficient filtering is possible for key-value RDDs,
i.e. RDDs over Tuple2, using PairRDDFunctions. My workflow currently looks
as follows:

// Create a dataframe
val df: DataFrame = sqlContext.sql("SELECT * FROM ...")
val keyValRDD = (r: Row) => (r.getAs[String]("key"), r) )

// Sort by key - and cache.
val keyValRDDSorted = keyValRDD.sortByKey().cache

// Define a function to run SQL query on a range.
def queryRange(lower: String, upper: String, sql: String, tableName:
String) = {
    val rangeRDD = keyValRDDSorted.filterByRange(lower, upper)
    val rangeDF = sqlContext.createDataFrame({ _._2 },

// Invoke multiple times.

This works, and is efficient in that only the partitions containing the
relevant key range are processed. However, I understand that Spark SQL uses
an optimised storage format as compared to plain RDDs. The above workflow
can't take advantage of this, as it is the key-value RDD that is cached.

So, my specific question: Is there a more efficient way of achieving the
desired result?

Any pointers would be much appreciated.

Many thanks,

PS: This question was also asked on StackOverflow -

View raw message