spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 163 <>
Subject How to tune the performance of Tpch query5 within Spark
Date Fri, 14 Jul 2017 09:46:48 GMT
I modify the tech query5 to DataFrame:
val forders ="hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/orders
< 1995-01-01 and o_orderdate >= 1994-01-01").select("o_custkey", "o_orderkey")
val flineitem ="hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/lineitem
val fcustomer ="hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/customer
val fsupplier ="hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/supplier
val fregion ="hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/region
<hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/region>”).where("r_name = 'ASIA'").select($"r_regionkey")
val fnation ="hdfs://dell127:20500/SparkParquetDoubleTimestamp100G/nation
val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
val res =   flineitem.join(forders, $"l_orderkey" === forders("o_orderkey"))
     .join(fcustomer, $"o_custkey" === fcustomer("c_custkey"))
     .join(fsupplier, $"l_suppkey" === fsupplier("s_suppkey") && $"c_nationkey" ===
     .join(fnation, $"s_nationkey" === fnation("n_nationkey"))
     .join(fregion, $"n_regionkey" === fregion("r_regionkey"))
     .select($"n_name", decrease($"l_extendedprice", $"l_discount").as("value"))

My environment is one master(Hdfs-namenode), four workers(HDFS-datanode), each with 40 cores
and 128GB memory.  TPCH 100G stored on HDFS using parquet format.
It executed about 1.5m, I found that read these 6 tables using is sequential,
How can I made this to run parallelly ?
 I’ve already set data locality and spark.default.parallelism, spark.serializer, using G1,
But the runtime  is still not reduced. 
And is there any advices for me to tuning this performance?
Thank you.

Wenting He

View raw message