spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: SparkSQL read Hive transactional table
Date Tue, 16 Oct 2018 10:35:12 GMT
Hi,

can I please ask which version of Hive and Spark are you using?

Regards,
Gourav Sengupta

On Tue, Oct 16, 2018 at 2:42 AM daily <asos.xu@foxmail.com> wrote:

> Hi,
>
> I use HCatalog Streaming Mutation API to write data to hive transactional
> table, and then, I use SparkSQL to read data from the hive transactional
> table. I get the right result.
> However, SparkSQL uses more time to read hive orc bucket transactional
> table, beacause SparkSQL read all columns(not The columns involved in SQL)
> so it uses more time.
> My question is why that SparkSQL read all columns of hive orc bucket
> transactional table, but not the columns involved in SQL? Is it possible to
> control the SparkSQL read the columns involved in SQL?
>
>
>
> For example:
> Hive Table:
> create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='true');
>
> create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='false');
>
> SparkSQL:
> select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
> select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
>
> SparkSQL's stage Input size:
>
> dbtest.t_a1=113.9 GB,
>
> dbtest.t_a2=96.5 MB
>
>
>
> Best regards.
>
>
>
>

Mime
View raw message