sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Boglarka Egyed (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-3151) Sqoop export HDFS file type auto detection can pick wrong type
Date Fri, 10 Mar 2017 14:15:04 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905132#comment-15905132
] 

Boglarka Egyed commented on SQOOP-3151:
---------------------------------------

Related code part which could be investigated:

{code}
ExportJobBase.java:
 190   private static FileType fromMagicNumber(Path file, Configuration conf) {
 191     // Test target's header to see if it contains magic numbers indicating its
 192     // file type
 193     byte [] header = new byte[3];
 194     FSDataInputStream is = null;
 195     try {
 196       FileSystem fs = file.getFileSystem(conf);
 197       is = fs.open(file);
 198       is.readFully(header);
 199     } catch (IOException ioe) {
 200       // Error reading header or EOF; assume unknown
 201       LOG.warn("IOException checking input file header: " + ioe);
 202       return FileType.UNKNOWN;
 203     } finally {
 204       try {
 205         if (null != is) {
 206           is.close();
 207         }
 208       } catch (IOException ioe) {
 209         // ignore; closing.
 210         LOG.warn("IOException closing input stream: " + ioe + "; ignoring.");
 211       }
 212     }
 213 
 214     if (header[0] == 'S' && header[1] == 'E' && header[2] == 'Q') {
 215       return FileType.SEQUENCE_FILE;
 216     }
 217     if (header[0] == 'O' && header[1] == 'b' && header[2] == 'j') {
 218       return FileType.AVRO_DATA_FILE;
 219     }
 220     if (header[0] == 'P' && header[1] == 'A' && header[2] == 'R') {
 221       return FileType.PARQUET_FILE;
 222     }
 223     return FileType.UNKNOWN;
 224   }
{code}

https://git-wip-us.apache.org/repos/asf?p=sqoop.git;a=blob;f=src/java/org/apache/sqoop/mapreduce/ExportJobBase.java;hb=98c5ccb80f8039dd5e1f9451c43443bb01dfd973#l190

It should be investigated if
* the code could be changed to avoid these cases
or
* a new command line option could be introduced to enforce Sqoop to use a specific file format
during export (similar options exist only for import currently)

> Sqoop export HDFS file type auto detection can pick wrong type
> --------------------------------------------------------------
>
>                 Key: SQOOP-3151
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3151
>             Project: Sqoop
>          Issue Type: Bug
>    Affects Versions: 1.4.6
>            Reporter: Boglarka Egyed
>
> It appears that Sqoop export tries to detect the file format by reading the first 3 characters
of a file. Based on that header, the appropriate file reader is used. However, if the result
set happens to contain the header sequence, the wrong reader is chosen resulting in a misleading
error.
> For example, if someone is exporting a table in which one of the field values is "PART".
Since Sqoop sees the letters "PAR", it is invoking the Kite SDK as it assumes the file is
in Parquet format. This leads to a misleading error:
> ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException:
Descriptor location does not exist: hdfs://<path>.metadata 
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: hdfs://<path>.metadata
> This can be reproduced easily, using Hive as a real world example:
> > create table test2 (val string);
> > insert into test1 values ('PAR');
> Then run a sqoop export against the table data:
> $ sqoop export --connect $MYCONN --username $MYUSER --password $MYPWD -m 1 --export-dir
/user/hive/warehouse/test --table $MYTABLE
> Sqoop will fail with the following:
> ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException:
Descriptor location does not exist: hdfs://<path>.metadata
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not exist: hdfs://<path>.metadata
> Changing value from "PAR" to something else, like 'Obj' (Avro) or 'SEQ' (sequencefile),
which will result in similar errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message