hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <>
Subject [jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter
Date Wed, 06 Jun 2018 12:27:00 GMT


Steve Loughran commented on HIVE-16295:

* PathOutputCommitterFactory; you can ask for that to become limited private + unstable and
add Hive into the mix, add a MAPREDUCE patch
* for the other, again, a limited private + unstable for the internal commit constant, so
we know to leave it alone , under HADOOP

bq. For the _SUCCESS file, is it something that is common to all PathOutputCommitter implementations

It's done in the S3A one, not done for FileOutputCommitter. The IBM Stocator committer also
does a JSON manifest, just a different one (i.e. I don't know the details). We explicitly
stuck a version marker on the one the S3A committer currently uses so as to allow for change,
that is: the deser code will fail if that's not there/the wrong version.

FWIW, I do parse the file in my spark tests. Originally I had my own copy & paste of the
file format, now I just import the s3a one.

> Add support for using Hadoop's S3A OutputCommitter
> --------------------------------------------------
>                 Key: HIVE-16295
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch, HIVE-16295.3.WIP.patch,
HIVE-16295.4.patch, HIVE-16295.5.patch, HIVE-16295.6.patch, HIVE-16295.7.patch
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a {{NullOutputCommitter}}
and uses its own commit logic spread across {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with S3Guard
and does a safe, coordinate commit of data on S3 inside individual tasks (HADOOP-13786). If
Hive can integrate with this new {{OutputCommitter}} there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means no renames
are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from task retries
or speculative execution) should not step on each other

This message was sent by Atlassian JIRA

View raw message