sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krishnan K <kkrishna...@gmail.com>
Subject Re: Incremental import fail to hive from PostgreSQL
Date Mon, 16 Apr 2012 03:49:36 GMT
Hi Roshan,

Yes, because the sqoop first pulls the data into HDFS before pushing it
into the hive table.
"FileAlreadyExistsException" occurs at the HDFS level and not at the
hive-end.

You may want to write a shell script that --
1) first deletes the existing directory in HDFS and then
2) execute the sqoop command

You can now run the shell script to perform your import operation from
PostgreSQL to Hive.

Regards,
Krishnan


On Mon, Apr 16, 2012 at 6:39 AM, Roshan Pradeep <codevally@gmail.com> wrote:

> Hi
>
> Thanks for the quick reply.
>
> Regarding my incremental import, do I need to remove the directory every
> time when I want to re-import the updated data from PostgreSQL?
>
> Because I am working with some GB size data and want to confirm this.
>
> Thanks.
>
>
> On Mon, Apr 16, 2012 at 9:48 AM, Krishnan K <kkrishnan88@gmail.com> wrote:
>
>> Hi Roshan,
>>
>> If you have run the sqoop command once and even if it fails, it creates
>> an output directory in HDFS.
>> You can delete this folder (users) and then try running this sqoop
>> command once again -
>>
>> *hadoop dfs -rmr users* and execute this command.
>>
>> Sqoop first imports the data from PostGRE into HDFS and
>> then moves into the hive default directory
>> (/user/hive/warehouse/<tablename>)
>>
>> For sqoop to copy this first into HDFS, you must ensure that it does not
>> already exist.
>>
>> -Krishnan
>>
>>
>> On Mon, Apr 16, 2012 at 4:49 AM, Roshan Pradeep <codevally@gmail.com>wrote:
>>
>>> Hi All
>>>
>>> I want to import the updated data from my source (PostgreSQL) to hive
>>> based on a column (lastmodifiedtime) in postgreSQL
>>>
>>> *The command I am using*
>>>
>>> /app/sqoop/bin/sqoop import --hive-table users --connect
>>> jdbc:postgresql:/<server_url>/<database> --table users --username
XXXXXXX
>>> --password YYYYYY --hive-home /app/hive --hive-import --incremental
>>> lastmodified --check-column lastmodifiedtime
>>>
>>> With the above command, I am getting the below error
>>>
>>> 12/04/13 16:31:21 INFO orm.CompilationManager: Writing jar file:
>>> /tmp/sqoop-root/compile/11ce8600a5656ed49e631a260c387692/users.jar
>>> 12/04/13 16:31:21 INFO tool.ImportTool: Incremental import based on
>>> column "lastmodifiedtime"
>>> 12/04/13 16:31:21 INFO tool.ImportTool: Upper bound value: '2012-04-13
>>> 16:31:21.865429'
>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: It looks like you are
>>> importing from postgresql.
>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: This transfer can be
>>> faster! Use the --direct
>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: option to exercise a
>>> postgresql-specific fast path.
>>> 12/04/13 16:31:21 INFO mapreduce.ImportJobBase: Beginning import of users
>>> 12/04/13 16:31:23 ERROR tool.ImportTool: Encountered IOException running
>>> import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output
>>> directory users already exists
>>>         at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:123)
>>>         at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:770)
>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>>         at
>>> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>>         at
>>> org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:141)
>>>         at
>>> org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:201)
>>>         at
>>> org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:413)
>>>         at
>>> org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:102)
>>>         at
>>> org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:380)
>>>         at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:453)
>>>         at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
>>>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
>>>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
>>>         at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
>>>         at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
>>>
>>> According to the above stack trace, sqoop it identify the updated data
>>> from postgreSQL, but it says output directory already exists. Could someone
>>> please help me to correct this issue.
>>>
>>> *I am using *
>>>
>>> Hadoop - 0.20.2
>>> Hive - 0.8.1
>>> Sqoop - 1.4.1-incubating
>>>
>>>
>>> Thanks.
>>>
>>
>>
>

Mime
View raw message