spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Lam <chiling...@gmail.com>
Subject Re: Using HQL is terribly slow: Potential Performance Issue
Date Thu, 10 Jul 2014 21:10:54 GMT
By the way, I also try hql("select * from m").count. It is terribly slow
too.


On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam <chilinglam@gmail.com> wrote:

> Hi Spark users and developers,
>
> I'm doing some simple benchmarks with my team and we found out a potential
> performance issue using Hive via SparkSQL. It is very bothersome. So your
> help in understanding why it is terribly slow is very very important.
>
> First, we have some text files in HDFS which are also managed by Hive as a
> table called "m". There is nothing special about the table name "m".
>
> In pure spark way, I will just do the following to get a total number of
> line of text files:
>
> scala>
> sc.textFile("hdfs://namenode:8020/user/hive/warehouse/test.db/m/*").count
>
> This takes 2.7 minutes.
>
> If I use SparkSQL, I will do this:
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> import hiveContext._
> hql("use test")
> hql("select count(*) from m").collect.foreach(println)
>
> This takes 11.9minutes!
>
> This is 4x slower than using pure spark.
>
> I wonder if anyone knows what causes the performance issue?
>
> For the curious mind, the dataset is about 200-300GB and we are using 10
> machines for this benchmark. Given the env is equal between the two
> experiments, why pure spark is faster than SparkSQL?
>
> Best Regards,
>
> Jerry
>
>
>
>

Mime
View raw message