sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Veena Basavaraj (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SQOOP-1168) Sqoop2: Incremental Import
Date Wed, 19 Nov 2014 15:44:34 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218035#comment-14218035
] 

Veena Basavaraj edited comment on SQOOP-1168 at 11/19/14 3:44 PM:
------------------------------------------------------------------

[~vinothchandar] 

Some thoughts, I will have a design wiki sometime by end of this week.

last_primary_key ( which is referred to as append mode in the Sqoop1 ) or last_modified will
both be supported. The latter is more work when writing the same to "HDFS" like data source,
since we have to scan all records that have been written before and then modify them, The
former is simple as it gets and more performant.

The latter is probably more useful than the former since I am assuming most of the use cases
will have mutable FROM data sources and it is wise to update any modified record incrementally.

Second, as far as how we provide the incremental reading from the FROM source.

We can specify these attributes of incremental and type( since_primary_key , since_last_modified)
 in the {code}FromJobConfiguration{code}

As far as storing state:

Storing state across runs we do it already to some extent in the submissions table in the
Sqoop Repository. So that should be fairy easy to extend to store this "last" or  "since"
so and so  marker, we could also support more complex markers in future, so that can be even
a query to scan for only certain records in that run. 

I do think the FromStateObject/ToState is pretty neat to have in the repo as well so that
we have more visibility into what went on in each run. Submission today represents the end
result of the sqoop job and is geared more towards the Execution engine stats. But we churn
out more details of the From/To state objects

Third,
the writing part ( appending, or random scans for each modified entry ) should probably be
pluggable. Even though sqoop should offload most of the work to itself and probably provide
an api /callback for things that can be custom.


was (Author: vybs):
[~vinothchandar] 

Some thoughts, I will have a design wiki sometime by end of this week.

last_primary_key ( which is referred to as append mode in the Sqoop1 ) or last_modified will
both be supported. The latter is more work when writing the same to "HDFS" like data source,
since we have to scan all records that have been written before and then modify them, The
former is simple as it gets and more performant.

The latter is probably more useful than the former since I am assuming most of the use cases
will have mutable FROM data sources and it is wise to update any modified record incrementally.

Second, as far as how we provide the incremental reading from the FROM source.

We can specify these attributes of incremental and type( since_primary_key , since_last_modified)
 in the {code}FromJobConfiguration{code}

As far as storing state:

Storing state across runs we do it already to some extent in the submissions table in the
Sqoop Repository. So that should be fairy easy to extend to store this "last" or  "since"
so and so  marker, we could also support more complex markers in future, so that can be even
a query to scan for only certain records in that run. 

I do think the FromStateObject/ToState is pretty neat to have in the repo as well so that
we have more visibility into what went on in each run. Submission today represents the end
result of the sqoop job and is geared more towards the Execution engine stats. But we churn
out more details of the From/To state objects

> Sqoop2: Incremental Import
> --------------------------
>
>                 Key: SQOOP-1168
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1168
>             Project: Sqoop
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> Initial plan is to follow roughly the same design as Sqoop 1, except provide pluggability
to start this through a REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message