spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McCarthy <pmccar...@dstillery.com>
Subject Poor performance reading Hive table made of sequence files
Date Tue, 01 May 2018 20:36:00 GMT
I recently ran a query with the following form:

select a.*, b.*
from some_small_table a
inner join
(
  select things from someother table
  lateral view explode(s) ss as sss
  where a_key is in (x,y,z)
) b
on a.key = b.key
where someothercriterion

On hive, this query took about five minutes. In Spark, using either the
same syntax in a spark.sql call or using the dataframe API, it appeared as
if it was going to take on the order of 10 hours. I didn't let it finish.

The data underlying the hive table are sequence files, ~30mb each, ~1000 to
a partition, and my query ran over only five partitions. A single partition
is about 25gb.

How can Spark perform so badly? Do I need to handle sequence files in a
special way?

Mime
View raw message