spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Pirz <>
Subject Repartitioning external table in Spark sql
Date Wed, 19 Aug 2015 05:50:18 GMT
I am using Spark 1.4.1 , in stand-alone mode, on a cluster of 3 nodes.

Using Spark sql and Hive Context, I am trying to run a simple scan query on
an existing Hive table (which is an external table consisting of rows in
text files stored in HDFS - it is NOT parquet, ORC or any other richer

DataFrame res = hiveCtx.sql("SELECT * FROM lineitem WHERE L_LINENUMBER <

What I observe is the performance of this full scan in Spark is not
comparable with Hive (it is almost 4 times slower). Checking the resource
usage, what I see is workers/executors do not do parallel scans but they
scan on a per-node basis (first executors from the worker(s) on node 1 do
reading from disk, while other two nodes are not doing I/O and just receive
data from the first node and through network, then 2nd node does the scan
and then the third one).
I also realized that if I load this data file directly from my spark
context (using textFile() ) and run count() on that (not using spark sql)
then I can get a better performance by increasing number of partitions. I
am just trying to do the same thing (increasing number of partitions in the
beginning) in Spark sql as:

var tab ="lineitem");

but none of repartition() or coalesce() methods actually work - they do not
return an error, but if I check

var p = tab.rdd.partitions.size;

before and after calling any of them, it returns the same number of

I am just wondering how I can change the number of partitions for a Hive
external table, in Spark Sql.

Any help/suggestion would be appreciated.

View raw message