spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wuchang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-19647) Spark query hive is extremelly slow even the result data is small
Date Sun, 19 Feb 2017 05:11:44 GMT

    [ https://issues.apache.org/jira/browse/SPARK-19647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873490#comment-15873490
] 

wuchang commented on SPARK-19647:
---------------------------------

Hi , I don't think that it is just a question ,but also it maybe a quite serious bug.But I
am not sure.

> Spark query hive is extremelly slow even the result data is small
> -----------------------------------------------------------------
>
>                 Key: SPARK-19647
>                 URL: https://issues.apache.org/jira/browse/SPARK-19647
>             Project: Spark
>          Issue Type: Question
>          Components: PySpark
>    Affects Versions: 2.0.2
>            Reporter: wuchang
>            Priority: Critical
>
> I am using spark 2.0.0 to query hive table:
> my sql is:
> select * from app.abtestmsg_v limit 10
> Yes, I want to get the first 10 records from a view app.abtestmsg_v.
> When I run this sql in spark-shell,it is very fast, USE about 2 seconds .
> But then the problem comes when I try to implement this query by my python code.
> I am using Spark 2.0.0 and write a very simple pyspark program, code is:
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import *
> import json
> hc = HiveContext(sc)
> hc.setConf("hive.exec.orc.split.strategy", "ETL")
> hc.setConf("hive.security.authorization.enabled",false)
> zj_sql = 'select * from app.abtestmsg_v limit 10'
> zj_df = hc.sql(zj_sql)
> zj_df.collect()
> From the info log , I find: although I use "limit 10" to tell spark that I just want
the first 10 records , but spark still scan and read all files(in my case, the source data
of this view contains 100 files and each file's size is about 1G) of the view , So , there
are nearly 100 tasks , each task read a file , and all the task is executed serially. I use
nearlly 15 minutes to finish these 100 tasks!!!!! but what I want is just to get the first
10 records.
> So , I don't know what to do and what is wrong;
> Anybode could give me some suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message