spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "daily" <asos...@foxmail.com>
Subject SparkSQL read Hive transactional table
Date Tue, 16 Oct 2018 01:42:01 GMT
Hi,
   
I use HCatalog Streaming   Mutation API to write data to hive transactional table, and then,
I use   SparkSQL to read data from the hive transactional table. I get the right   result.
   However, SparkSQL uses more time to read hive orc bucket transactional table,   beacause
SparkSQL read all columns(not The columns involved in SQL) so it   uses more time.
   My question is why that SparkSQL read all columns of hive orc bucket   transactional table,
but not the columns involved in SQL? Is it possible to   control the SparkSQL read the columns
involved in SQL?
   
 
   
For example:
   Hive Table:
   create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int)   partitioned
by(sd string,st string) clustered by(t0) into 10 buckets stored   as orc TBLPROPERTIES ('transactional'='true');
   
create table dbtest.t_a2 (t0   VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) partitioned
by(sd string,st   string) clustered by(t0) into 10 buckets stored as orc TBLPROPERTIES   ('transactional'='false');
   
SparkSQL: 
   select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
   select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
   
SparkSQL's stage Input size:
   
dbtest.t_a1=113.9 GB,
   
dbtest.t_a2=96.5 MB
   
 
   
Best regards.
Mime
View raw message