spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "assaf.mendelson" <assaf.mendel...@rsa.com>
Subject repeated unioning of dataframes take worse than O(N^2) time
Date Thu, 29 Dec 2016 17:21:23 GMT
Hi,

I have been playing around with doing union between a large number of dataframes and saw that
the performance of the actual union (not the action) is worse than O(N^2). Since a union basically
defines a lineage (i.e. current + union with of other as a child) this should be almost instantaneous,
however in practice this can be very costly.

I was wondering why this is and if there is a way to fix this.

A sample test:
def testUnion(n: Int): Long = {
  val dataframes = for {
    x <- 0 until n
  } yield spark.range(1000)

  val t0 = System.currentTimeMillis()
  val allDF = dataframes.reduceLeft(_.union(_))
  val t1 = System.currentTimeMillis()
  val totalTime = t1 - t0
  println(s"$totalTime miliseconds")
  totalTime
}

scala> testUnion(100)
193 miliseconds
res5: Long = 193

scala> testUnion(200)
759 miliseconds
res1: Long = 759

scala> testUnion(500)
4438 miliseconds
res2: Long = 4438

scala> testUnion(1000)
18441 miliseconds
res6: Long = 18441

scala> testUnion(2000)
88498 miliseconds
res7: Long = 88498

scala> testUnion(5000)
822305 miliseconds
res8: Long = 822305






--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Mime
View raw message