spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evan R. Sparks" <>
Subject Re: Scan Sharing in Spark
Date Tue, 05 May 2015 16:23:42 GMT
Scan sharing can indeed be a useful optimization in spark, because you
amortize not only the time spent scanning over the data, but also time
spent in task launch and scheduling overheads.

Here's a trivial example in scala. I'm not aware of a place in SparkSQL
where this is used - I'd imagine that most development effort is being
placed on single-query optimization right now.

//This function takes a sequence of functions of type A => B and returns a
function of A => Seq[B] where each item in the input list corresponds to a
def combineFunctions[A,B](fns: Seq[A=>B]): A => Seq[B] = {
   def combf(a: A): Seq[B] = { => f(a))

def plusOne(x: Int) = x + 1
def timesFive(x: Int) = x * 5

val sharedF = combineFunctions(Seq[Int => Int](plusOne, timesFive))

val data = sc.parallelize(Array(1,2,3,4,5,6,7))

//Apply this combine function to each of your data elements.
val res =


The result will look something like this.

res5: Array[Seq[Int]] = Array(List(2, 5), List(3, 10), List(4, 15), List(5,
20), List(6, 25))

On Tue, May 5, 2015 at 8:53 AM, Quang-Nhat HOANG-XUAN <
> wrote:

> Hi everyone,
> I have two Spark jobs inside a Spark Application, which read from the same
> input file.
> They are executed in 2 threads.
> Right now, I cache the input file into memory before executing these two
> jobs.
> Are there another ways to share their same input with just only one read?
> I know there is something called Multiple Query Optimization, but I don't
> know if it can be applicable on Spark (or SparkSQL) or not?
> Thank you.
> Quang-Nhat

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message