sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Venkat Ranganathan <vranganat...@hortonworks.com>
Subject Re: [jira] [Commented] (SQOOP-1293) --hive-import causes --target-dir and --warehouse-dir to not be respected, nor --delete-target-dir
Date Fri, 28 Mar 2014 16:49:11 GMT
+ User mailing list - moving dev to BCC

When you use hcatalog import, there is no intermediary step.  The data
is directly written to the target storage format.   So, there are no
two steps.   And the default hcatalog-table format is RCFile, but can
be changed.

It looks like your use case requires the data to be available even
after Hive tables are created.   My only suggestion in that case will
be to copy the files after sqoop import and do the load table hive
command separately

Venkat


On Fri, Mar 28, 2014 at 5:06 AM, Hari Sekhon (JIRA) <jira@apache.org> wrote:
>
>     [ https://issues.apache.org/jira/browse/SQOOP-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950615#comment-13950615
]
>
> Hari Sekhon commented on SQOOP-1293:
> ------------------------------------
>
> Yes the staging and then moving under warehouse is as I thought, thanks for confirming
that.
>
> I think it would be an improvement to make those switches respected in terms of retaining
the data in the desired location, basically just skipping the move under warehouse action,
or make the behaviour clearer with warnings that those switches won't be respected.
>
> Can you please clarify that last statement, I did an import via HCatalog using the switches
> {noformat}--hcatalog-create-table --hcatalog-table counterparty{noformat}
> but the result is pretty much the same, fully managed hive table inside warehouse dir,
except for some reason it generated the files as RCfiles, which was not intended.
>
> How about adding an --hive-external-table option to skip the table move under warehouse
after import and mark the table external?
>
> That switch could enable a behaviour to allow the switches for directory and delete to
be respected.
>
> Right now I've gone back to a sqoop import for data and then a separate sqoop create-hive-table
for the metadata and then an alter table on the location metadata location property which
is what we used to do at my last company.
>
> Thanks
>
> Hari
>
>> --hive-import causes --target-dir and --warehouse-dir to not be respected, nor --delete-target-dir
>> --------------------------------------------------------------------------------------------------
>>
>>                 Key: SQOOP-1293
>>                 URL: https://issues.apache.org/jira/browse/SQOOP-1293
>>             Project: Sqoop
>>          Issue Type: Bug
>>          Components: connectors/sqlserver, hive-integration
>>    Affects Versions: 1.4.4
>>         Environment: Hortonworks Sandbox 2.0
>>            Reporter: Hari Sekhon
>>
>> Hi,
>> I'm importing a table from SQL Server 2012 and am using --hive-import to create the
metadata automatically, but am finding that it causes --target-dir and --warehouse-dir to
not be respected, nor --delete-target-dir.
>> sqoop import --connect "jdbc:sqlserver://x.x.x.x:1533;database=MyDatabase" --username
omitted --password omitted --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --table "cube.DimCounterParty"
--split-by CounterpartyKey --hive-import --target-dir /MyDatabase/CounterParty --delete-target-dir
>> (fyi I'm using --driver to work around bug SQOOP-1292)
>> So I tried --warehouse-dir in case it needed that instead of --target-dir
>> sqoop import --connect "jdbc:sqlserver://x.x.x.x:1533;database=MyDatabase" --username
omitted --password omitted --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --table "cube.DimCounterParty"
--split-by CounterpartyKey --hive-import --warehouse-dir /MyDatabase/CounterParty --delete-target-dir
>> but in both cases it ingested the data to /apps/hive/warehouse/cube.db/dimcounterparty.
>> What's also strange is that it created the directory specified for --warehouse-dir
but then didn't appear to place the data in it.
>> I wanted to use --delete-target-dir to replace the whole table each time for this
test since the source table is only ~650,000 rows and 185MB.
>> What I've found is that on top of ingesting in to /apps/hive/warehouse/cube.db/dimcounterparty
by disregarding --delete-target-dir it is causing the table volume to grow cumulatively for
each run, such that after a few runs the {noformat}select count(*){noformat} on the table
now shows 5,546,661 rows instead of 650,000.
>> Here is the the hive warehouse directory on HDFS where you can see the accumulation
of the data:
>> {noformat}
>>  hadoop fs -ls /apps/hive/warehouse/cube.db/dimcounterparty/
>> Found 40 items
>> -rw-r--r--   3 root hdfs          0 2014-03-07 08:44 /apps/hive/warehouse/cube.db/dimcounterparty/_SUCCESS
>> -rw-r--r--   3 root hdfs          0 2014-03-07 09:10 /apps/hive/warehouse/cube.db/dimcounterparty/_SUCCESS_copy_1
>> -rw-r--r--   3 root hdfs          0 2014-03-07 09:33 /apps/hive/warehouse/cube.db/dimcounterparty/_SUCCESS_copy_2
>> -rw-r--r--   3 root hdfs          0 2014-03-07 09:37 /apps/hive/warehouse/cube.db/dimcounterparty/_SUCCESS_copy_3
>> -rw-r--r--   3 root hdfs          0 2014-03-07 09:42 /apps/hive/warehouse/cube.db/dimcounterparty/_SUCCESS_copy_4
>> -rw-r--r--   3 root hdfs          0 2014-03-07 10:04 /apps/hive/warehouse/cube.db/dimcounterparty/_SUCCESS_copy_5
>> -rw-r--r--   3 root hdfs          0 2014-03-07 10:14 /apps/hive/warehouse/cube.db/dimcounterparty/_SUCCESS_copy_6
>> -rw-r--r--   3 root hdfs          0 2014-03-07 10:16 /apps/hive/warehouse/cube.db/dimcounterparty/_SUCCESS_copy_7
>> -rw-r--r--   3 root hdfs   49044407 2014-03-07 08:44 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00000
>> -rw-r--r--   3 root hdfs   49045389 2014-03-07 09:10 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00000_copy_1
>> -rw-r--r--   3 root hdfs   49045944 2014-03-07 09:33 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00000_copy_2
>> -rw-r--r--   3 root hdfs   49045944 2014-03-07 09:37 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00000_copy_3
>> -rw-r--r--   3 root hdfs   49045944 2014-03-07 09:41 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00000_copy_4
>> -rw-r--r--   3 root hdfs   49045944 2014-03-07 10:04 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00000_copy_5
>> -rw-r--r--   3 root hdfs   49045944 2014-03-07 10:14 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00000_copy_6
>> -rw-r--r--   3 root hdfs   49045944 2014-03-07 10:15 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00000_copy_7
>> -rw-r--r--   3 root hdfs   52363518 2014-03-07 08:44 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00001
>> -rw-r--r--   3 root hdfs   52363912 2014-03-07 09:10 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00001_copy_1
>> -rw-r--r--   3 root hdfs   52364256 2014-03-07 09:33 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00001_copy_2
>> -rw-r--r--   3 root hdfs   52364256 2014-03-07 09:37 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00001_copy_3
>> -rw-r--r--   3 root hdfs   52364256 2014-03-07 09:41 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00001_copy_4
>> -rw-r--r--   3 root hdfs   52364256 2014-03-07 10:03 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00001_copy_5
>> -rw-r--r--   3 root hdfs   52364256 2014-03-07 10:14 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00001_copy_6
>> -rw-r--r--   3 root hdfs   52364256 2014-03-07 10:15 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00001_copy_7
>> -rw-r--r--   3 root hdfs   51796051 2014-03-07 08:44 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00002
>> -rw-r--r--   3 root hdfs   51796027 2014-03-07 09:10 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00002_copy_1
>> -rw-r--r--   3 root hdfs   51796623 2014-03-07 09:33 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00002_copy_2
>> -rw-r--r--   3 root hdfs   51796623 2014-03-07 09:37 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00002_copy_3
>> -rw-r--r--   3 root hdfs   51796623 2014-03-07 09:41 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00002_copy_4
>> -rw-r--r--   3 root hdfs   51796623 2014-03-07 10:03 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00002_copy_5
>> -rw-r--r--   3 root hdfs   51796623 2014-03-07 10:14 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00002_copy_6
>> -rw-r--r--   3 root hdfs   51796623 2014-03-07 10:15 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00002_copy_7
>> -rw-r--r--   3 root hdfs   45445570 2014-03-07 08:44 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00003
>> -rw-r--r--   3 root hdfs   45445544 2014-03-07 09:10 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00003_copy_1
>> -rw-r--r--   3 root hdfs   45445719 2014-03-07 09:33 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00003_copy_2
>> -rw-r--r--   3 root hdfs   45445719 2014-03-07 09:37 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00003_copy_3
>> -rw-r--r--   3 root hdfs   45445719 2014-03-07 09:42 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00003_copy_4
>> -rw-r--r--   3 root hdfs   45445719 2014-03-07 10:04 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00003_copy_5
>> -rw-r--r--   3 root hdfs   45445719 2014-03-07 10:14 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00003_copy_6
>> -rw-r--r--   3 root hdfs   45445719 2014-03-07 10:16 /apps/hive/warehouse/cube.db/dimcounterparty/part-m-00003_copy_7
>> {noformat}
>> Is this a bug that it doesn't respect --target-dir or at least --warehouse-dir?
>> This highlights another issue that this should be more intuitive and/or {noformat}sqoop
import --help{noformat} should make it easier to see what options are (not) compatible, or
it should specify in the output at job initiation time where switches will be disregarded,
such as it does when using {noformat}--hive-<option>{noformat} without {noformat}--hive-import{noformat}
>> In my last place I recall using sqoop create-hive-table to generate the metadata
after import and then editing the table location metadata. It would be a lot better if we
could fix the behaviour of --hive-import to not require such a multi-step workaround.
>> Thanks
>> Hari Sekhon
>> http://www.linkedin.com/in/harisekhon
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message