spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wuchang (JIRA)" <>
Subject [jira] [Created] (SPARK-19647) Spark query hive is extremelly slow even the result data is small
Date Fri, 17 Feb 2017 12:04:41 GMT
wuchang created SPARK-19647:

             Summary: Spark query hive is extremelly slow even the result data is small
                 Key: SPARK-19647
             Project: Spark
          Issue Type: Question
          Components: PySpark
    Affects Versions: 2.0.2
            Reporter: wuchang
            Priority: Critical

I am using spark 2.0.0 to query hive table:

my sql is:

select * from app.abtestmsg_v limit 10
Yes, I want to get the first 10 records from a view app.abtestmsg_v.

When I run this sql in spark-shell,it is very fast, USE about 2 seconds .

But then the problem comes when I try to implement this query by my python code.

I am using Spark 2.0.0 and write a very simple pyspark program, code is:

from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
zj_sql = 'select * from app.abtestmsg_v limit 10'
zj_df = hc.sql(zj_sql)
>From the info log , I find: although I use "limit 10" to tell spark that I just want the
first 10 records , but spark still scan and read all files(in my case, the source data of
this view contains 100 files and each file's size is about 1G) of the view , So , there are
nearly 100 tasks , each task read a file , and all the task is executed serially. I use nearlly
15 minutes to finish these 100 tasks!!!!! but what I want is just to get the first 10 records.

So , I don't know what to do and what is wrong;

Anybode could give me some suggestions?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message