hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergio Peña (JIRA) <j...@apache.org>
Subject [jira] [Commented] (HIVE-14270) Write temporary data to HDFS when doing inserts on tables located on S3
Date Thu, 21 Jul 2016 23:26:20 GMT

    [ https://issues.apache.org/jira/browse/HIVE-14270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388608#comment-15388608
] 

Sergio Peña commented on HIVE-14270:
------------------------------------

I run several tests using just 1 or 2 rows to validate the running time. 
Results can be 1.5x - 2.5x faster.

{noformat}
NO PATCH
** NON-PARTITIONED TABLE

- create table dummy (id int);                                                           
               3.651s
- insert into table s3dummy values (1);                                                  
              39.231s
- insert overwrite table s3dummy values (1);                                             
              42.569s
- insert overwrite directory 's3a://spena-bucket/dirs/s3dummy' select * from dummy;      
              30.136s

EXTERNAL TABLE

- create table s3dummy_ext like s3dummy location 's3a://spena-bucket/user/hive/warehouse/s3dummy';
      9.297s
- insert into table s3dummy_ext values (1);                                              
              45.855s
{noformat}

{noformat}
PATCH HIVE-14270

** NON-PARTITIONED TABLE
- create table s3dummy (id int) location 's3a://spena-bucket/user/hive/warehouse/s3dummy';
              3.945s
- insert into table s3dummy values (1);                                                  
              15.025s
- insert overwrite table s3dummy values (1);                                             
              25.149s     
- insert overwrite directory 's3a://spena-bucket/dirs/s3dummy' select * from dummy;      
              19.158s      
- from dummy insert overwrite table s3dummy select *;                                    
              25.469s      
- from dummy insert into table s3dummy select *;                                         
              14.501s

** EXTERNAL TABLE
- create table s3dummy_ext like s3dummy location 's3a://spena-bucket/user/hive/warehouse/s3dummy';
      4.827s
- insert into table s3dummy_ext values (1);                                              
              16.070s

** PARTITIONED TABLE
- create table s3dummypart (id int) partitioned by (part int)
  location 's3a://spena-bucket/user/hive/warehouse/s3dummypart';                         
               3.176s
- alter table s3dummypart add partition (part=1);                                        
               3.229s
- alter table s3dummypart add partition (part=2);                                        
               3.124s
- insert into table s3dummypart partition (part=1) values (1);                           
              14.876s
- insert overwrite table s3dummypart partition (part=1) values (1);                      
              27.594s     
- insert overwrite directory 's3a://spena-bucket/dirs/s3dummypart' select * from dummypart;
            22.298s      
- from dummypart insert overwrite table s3dummypart partition (part=1) select id;        
              29.001s      
- from dummypart insert into table s3dummypart partition (part=1) select id;             
              14.869s

** DYNAMIC PARTITIONS
- insert into table s3dummypart partition (part) select id, 1 from dummypart;            
              15.185s
- insert into table s3dummypart partition (part) select id, 1 from dummypart;            
              18.820s
{noformat}

Also, I did  a quick run with an Avro table 430Mb. This does not show a 2x faster time, but
it is some gain.
+ there was a bug that this patch fixes with INSERT INTO

{noformat}
NO PATCH
- create table s3student like student location 's3a://spena-bucket/user/hive/warehouse/s3student';
        4.316s
- insert into table s3student select * from student;                                     
                68.425s
- insert overwrite table s3student select * from student;                                
                70.032s
- insert into table s3student select * from student;                                     
                65.429s  # insert into does not append (bug)

PATCH HIVE-14270
- create table s3student like student location 's3a://spena-bucket/user/hive/warehouse/s3student';
        3.600s
- insert into table s3student select * from student;                                     
                51.964s
- insert overwrite table s3student select * from student;                                
                42.72s
- insert into table s3student select * from student;                                     
                73.496s
{noformat}

The above tests are not good to check for performance, but they were used to validate this
patch is working as expected.

I tried to write unit tests and integration tests, but it is difficult to do S3 tests with
the Hive test code. If anyone has good ideas about how to do this, it is welcome.

> Write temporary data to HDFS when doing inserts on tables located on S3
> -----------------------------------------------------------------------
>
>                 Key: HIVE-14270
>                 URL: https://issues.apache.org/jira/browse/HIVE-14270
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>         Attachments: HIVE-14270.1.patch
>
>
> Currently, when doing INSERT statements on tables located at S3, Hive writes and reads
temporary (or intermediate) files to S3 as well. 
> If HDFS is still the default filesystem on Hive, then we can keep such temporary files
on HDFS to keep things run faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message