spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Paul <>
Subject Some of the statistics function in SparkSQL is very slow
Date Sat, 01 Nov 2014 00:30:50 GMT
Hi all, some of the statistics function that I tried in HiveContext is
very slow, notably percentile, var_sampl, the symptom is same as what
I describe in my previous email,  when I do schemaRDD.collect on the
resulting RDD, the shuffle size is around 1000GB, could I do anything
else to speed up this?

Kevin Paul
---------- Forwarded message ----------
From: Kevin Paul <>
Date: Sat, Oct 25, 2014 at 8:48 PM
Subject: HiveSQL percentile is query slow
To: user <>

Hi all, I tried to run the following sql command in HiveContext with
my table loaded into memory:
  SELECT percentile(myColumn, array(0.1, 0.5)) FROM myTable

The query took more than 5 minutes to complete, but the query like
  SELECT min(myColumn), max(myColumn) FROM myTable
only took around 10 seconds to run.

My Spark version is 1.2.0 SNAPSHOT, the cluster is 10 slaves, and the
dataset is 10G, and I'm running on Yarn-client mode.
The query took two stages to run:
 1st. is mapPartitions at Exchanged.scala:86  with duration 9s
 2nd. is collect at SparkPlan.scala: 85 with duration 5.3 min

I attach the Summary Metrics for the collect task here
Kevin Paul

View raw message