hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-14792) AvroSerde reads the remote schema-file at least once per mapper, per table reference.
Date Tue, 19 Dec 2017 04:12:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-14792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296170#comment-16296170
] 

Hive QA commented on HIVE-14792:
--------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  0s{color} | {color:blue}
Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} |
{color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 27s{color} | {color:blue}
Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 43s{color}
| {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 15s{color} |
{color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 50s{color}
| {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  4s{color} |
{color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 20s{color} | {color:blue}
Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 33s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 18s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 18s{color} | {color:green}
the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 16s{color} | {color:red}
common: The patch generated 1 new + 930 unchanged - 1 fixed = 931 total (was 931) {color}
|
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 29s{color} | {color:red}
ql: The patch generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color}
| {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 10s{color} |
{color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 11s{color}
| {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 16m  1s{color} | {color:black}
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03)
x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / 9efed65 |
| Default Java | 1.8.0_111 |
| checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-8310/yetus/diff-checkstyle-common.txt
|
| checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-8310/yetus/diff-checkstyle-ql.txt
|
| modules | C: common ql U: . |
| Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-8310/yetus.txt |
| Powered by | Apache Yetus    http://yetus.apache.org |


This message was automatically generated.



> AvroSerde reads the remote schema-file at least once per mapper, per table reference.
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-14792
>                 URL: https://issues.apache.org/jira/browse/HIVE-14792
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.1, 2.1.0
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>              Labels: TODOC2.2, TODOC2.4
>             Fix For: 3.0.0, 2.4.0, 2.2.1
>
>         Attachments: HIVE-14792.1.patch, HIVE-14792.3.patch
>
>
> Avro tables that use "external" schema files stored on HDFS can cause excessive calls
to {{FileSystem::open()}}, especially for queries that spawn large numbers of mappers.
> This is because of the following code in {{AvroSerDe::initialize()}}:
> {code:title=AvroSerDe.java|borderStyle=solid}
> public void initialize(Configuration configuration, Properties properties) throws SerDeException
{
> // ...
>     if (hasExternalSchema(properties)
>         || columnNameProperty == null || columnNameProperty.isEmpty()
>         || columnTypeProperty == null || columnTypeProperty.isEmpty()) {
>       schema = determineSchemaOrReturnErrorSchema(configuration, properties);
>     } else {
>       // Get column names and sort order
>       columnNames = Arrays.asList(columnNameProperty.split(","));
>       columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
>       schema = getSchemaFromCols(properties, columnNames, columnTypes, columnCommentProperty);
>          properties.setProperty(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
schema.toString());
>     }
> // ...
> }
> {code}
> For tables using {{avro.schema.url}}, every time the SerDe is initialized (i.e. at least
once per mapper), the schema file is read remotely. For queries with thousands of mappers,
this leads to a stampede to the handful (3?) datanodes that host the schema-file. In the best
case, this causes slowdowns.
> It would be preferable to distribute the Avro-schema to all mappers as part of the job-conf.
The alternatives aren't exactly appealing:
> # One can't rely solely on the {{column.list.types}} stored in the Hive metastore. (HIVE-14789).
> # {{avro.schema.literal}} might not always be usable, because of the size-limit on table-parameters.
The typical size of the Avro-schema file is between 0.5-3MB, in my limited experience. Bumping
the max table-parameter size isn't a great solution.
> If the {{avro.schema.file}} were read during query-planning, and made available as part
of table-properties (but not serialized into the metastore), the downstream logic will remain
largely intact. I have a patch that does this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message