sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Kemper (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SQOOP-2874) Highlight Sqoop import with --as-parquetfile use cases (Dataset name <NAME> is not alphanumeric (plus '_'))
Date Mon, 07 Mar 2016 13:51:40 GMT
Markus Kemper created SQOOP-2874:
------------------------------------

             Summary: Highlight Sqoop import with --as-parquetfile use cases (Dataset name
<NAME> is not alphanumeric (plus '_'))
                 Key: SQOOP-2874
                 URL: https://issues.apache.org/jira/browse/SQOOP-2874
             Project: Sqoop
          Issue Type: Improvement
          Components: docs
            Reporter: Markus Kemper


Hello Sqoop Community,

Would it be possible to request some documentation enhancements?

The ask is here is to proactively help raise awareness and improve user experience with a
few specific use cases [1] where some Sqoop commands have restricted character options when
using import with --as-parquetfile.  

My understanding is Sqoop1 currently relies on Kite Datasets to write Parquet files.  From
the Kite documentation [3] we see that to ensure compatibility (with Hive, etc.), Kite imposes
some restrictions on Names and Namespaces which bubble up in Sqoop.

The following Sqoop use cases when using import with --as-parquetfile result in the error
[2] below.  Full tests cases for each scenario are attached.  If it is an option to enhance
the Sqoop documentation for these use cases I am happy to provide proposed changes, let me
know.

[1] Use Cases:
1. sqoop import --as-parquetfile + --target-dir /<path>/<rdbms_database>.<table>
1.1. The '.' is not allowed
2. sqoop import --as-parquetfile + --table <rdbms_database>.<table>  + (no --target-dir)
2.1. The '.' is not allowed, this is essentially the same as (1)
3. sqoop import --as-parquetfile + --hive-import --table <hive_database>.<table>

3.1. The proper usage is to use --hive-database with --hive-table however with --as-textfile
--hive-table works with <hive_database>.<table>

[2] Kite Error:
16/03/06 08:45:56 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.ValidationException:
Dataset name DATABASE.TABLE is not alphanumeric (plus '_')
org.kitesdk.data.ValidationException: Dataset name DATABASE.TABLE is not alphanumeric (plus
'_')
	at org.kitesdk.data.ValidationException.check(ValidationException.java:55)
	at org.kitesdk.data.spi.Compatibility.checkDatasetName(Compatibility.java:105)
	at org.kitesdk.data.spi.Compatibility.check(Compatibility.java:68)
	at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.create(FileSystemMetadataProvider.java:209)
	at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:137)
	at org.kitesdk.data.Datasets.create(Datasets.java:239)
	at org.kitesdk.data.Datasets.create(Datasets.java:307)
	at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:141)
	at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:119)
	at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:130)
	at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)
	at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
	at org.apache.sqoop.manager.OracleManager.importTable(OracleManager.java:444)
	at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
	at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
	at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
	at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
	at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
	at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

[3] Kite Documenation:
http://kitesdk.org/docs/1.0.0/introduction-to-datasets.html
Names and Namespaces
URIs also define a name and namespace for your dataset. Kite uses these values when the underlying
system has the same concept (for example, Hive). The name and namespace are typically the
last two values in a URI. For example, if you create a dataset using the URI dataset:hive:fact_tables/ratings,
Kite stores a Hive table ratings in the fact_tables Hive database. If you create a dataset
using the URI dataset:hdfs:/user/cloudera/fact_tables/ratings, Kite stores an HDFS dataset
named ratings in the fact_tables namespace.  To ensure compatibility with Hive and other underlying
systems, names and namespaces in URIs must be made of alphanumeric or underscore (_) characters
and cannot start with a number.

Thanks, Markus



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message