sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Veena Basavaraj (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-1804) Repository Structure + API: Storing/Retrieving the From/To state of the incremental read/ write
Date Tue, 20 Jan 2015 16:46:35 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283997#comment-14283997

Veena Basavaraj commented on SQOOP-1804:

Summary from the above discussions, broadly we had two proposals

*Proposal 1*
Connector has Config object and Config object has inputs. All these inputs are editable by
the user or connector. But in some cases some of these variable may be just created and edited
by connector ( temp state variables). In this proposal there is a possibility that the user
's initial input is lost and the connector overrides it with another value. 

Say we have 3 variables the connector exposes in the config
1. PrimaryKey
2. lastValue
3. TempDirectory

Initially user will create a job with the config object

1. PrimaryKey -"id" - STRING
2. lastValue- "1" - LONG
3. TempDirectory -   STRING ( Editable by the connector only)
and then the connector will run the job and at then end of the run we have the config object
looking like this

1. PrimaryKey -"id"
2. lastValue- "45"
3. TempDirectory - "/tmp/foo/bar 

We then took some use cases, where there may not be a one-one mapping between the initial
inputs user enters and connectors will create. Here is an example related to Kafka
Say we have one variable in kafka connector 

1. Offset - earliest/latest - ENUM
2. InternalOffset -  - MAP/LIST/ STRING

The user then chooses the value for offset not worrying about partitions etc and the connector
stores additional information it can use for the next run in  internaloffset as LIST/MAP with
partition/offset info. So in this case, connector never modified  the original input, but
it created new input that will be more useful for it in the next delta run. We may also want
to expose this to be edited by the user next time. 

We cannot be sure that all connectors will have one-one correspondence to a inputs users gives
and the state variables the connectors create to prepare itself optimal next run. In some
cases the types will be very different, the user may input a ENUM or a LONG, but the next
job run might need more info like a LIST or MAP to optimally run the job for delta records.
This was pretty much the proposal 2

*Proposal 2*
Connectors can have a config objects have params some are  Inputs and some are States, they
are all declared in the same config object so that the connector declares upfront what it
is working with, Inputs can be annotated, States as well can be annotated to provide more
information to the user and the connector. There are two main parties in this design, USER
and CONNECTOR DEVELOPER/CONNECTOR and they both together control the job inputs.
One way to express this was given in the earlier discussion in the same JIRA ( Names are subject
to change)
public class DeltaFetchConfig1 {
  @Input(size = 255) public String column;
  // validate supported operators, can provide a default value etc...
  @Input public String operator;
// if we edit this, override last_value
  @Input(overrides="last_value) public String value;
// relatedTo signifies to the connector that last_value relates to value
  @State(relatedTo="value") public String last_value; 

Starting from scratch always seems to help I guess, so restating the 2 proposals with examples
it became clear that what we all wanted was a flexible design to state what inputs are editable
by whom ( whether they are editable by user only, connector only, and or by both ), we wanted
to support all 3, but leave it to the connector to annotate each input properly ( Now it is
an implementation detail, if we want to use attributes on the @Input annotation or use another
annotation like @State). It was also pretty clear that all this data now will be now stored
in SQ_INPUT. and its values will be in SQ_JOB_INPUT. We may need to add more columns if we
add more attributes, but that is again an implementation detail. We may have to add new TYPES
to the ENUM - INPUT_TYPE or we may have to add a new class STATE and a new table STATE if
we add a new annotation like @State. But all this is up for discussion from an implementation

But when a user says show job, he/she will see all this information in one place. When the
user says edit job, he/she can edit the editable inputs in one place and look at the non-editable
inputs in one place.

Here are the following top things we will tackle .

1. Clearly define the semantics of the annotations/ attributes to support USER/ CONNECTOR/
BOTH editable.
2. We have define a way to define relationship between the inputs, say Input1 depends on/relatesTo/
overrides Input2, this gives connectors more control and ease to choose the correct value
when running subsequent job runs
3. We should be careful about inputs that both user and connector can edit, since the ordering
of edits might lead to some race conditions since we do not maintain history yet, so add proper
validation on the relationships between the attribute values across inputs.
4. Ability to see all the configs for jobs and then all the inputs and its attributes they
have - a command line tool commands
5. Finally if would be nice to start storing history when we make config edits, i.e not just
have last snapshot of the configs, but somehow we can map the configs used for every job run.
( This is covered in SQOOP-2025)

For the #1-#4, here are the details of the implementation we propose


We will keep the @Input annotation for everything,  and add "editable" attribute on the Input
annotation. The options for the editable will be a ENUM( USER_ONLY, CONNECTOR_ONLY, ANY, os
that in future we have more it can include them too, not just BOTH)

@Input ( editable =USER_ONLY) String primaryKey;
@Input ( editable =CONNECTOR_ONLY) String tempDirector;
@Input ( editable =ANY) String lastValue;

#2. For relationships, we will add a new attribute overrides, ( that can be 1-many at this
point), i,e a input can override values of one or more inputs. 
Implementation: We will add a new table to store this 1-M relationship. We will at the minimum
add some checks to ensure there is cycles, and ensure such complex relationships between config
atttributes wont be allowed. We also will allow this for intra config only and for attributes
across config objects. If we need to define relationships between attributes, it is best to
group them one config object.

#3. Since we want to control the relationships between inputs, and check for cycles and make
sure it does not manifest itself as confusing to the users we will do some validations.
There might some edge cases here, that I will think through more in implementation, for now
these are validation rules

USER_ONLY                    ( not itself!! ) other  CONNECTOR_ONLY and ANY inputs, dont allow
overriding other USER_ONLY attributes
CONNECTOR_ONLY      can override anything, it can override other CONNECTOR_ONLY inputs possible,
so lets allow everything, 
ANY                                  can override anything

> Repository Structure + API: Storing/Retrieving the From/To state of the incremental read/
> -----------------------------------------------------------------------------------------------
>                 Key: SQOOP-1804
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1804
>             Project: Sqoop
>          Issue Type: Sub-task
>            Reporter: Veena Basavaraj
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.5
> Details of this proposal are in the wiki.
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+And+Merge+Design#DeltaFetchAndMergeDesign-Wheretostoretheoutputinsqoop?
> Update: The above highlights the pros and cons of each approach. 
> #4 is chosen, since it is less intrusive, more clean and allows U/Edit per value in the
output easily.
> Will use this ticket for more detailed discussion on storage options for the output from
> 1. 
> {code}
> // will have FK to submission
BY 1), "
>      + COLUMN_SQ_JOB_OUT_KEY + " VARCHAR(32), "
>      + COLUMN_SQD_ID + " VARCHAR(32)," // FK to the direction table, since this allows
to distinguish output from FROM/ TO part of the job
> {code}
> 2.
> At the code level, we will define  MOutputType, one of the types can be BLOB as well,
if a connector decides to store the value as a BLOB
> {code}
> class JobOutput {
> String key;
> Object value;
> MOutputType type;
> }
> {code}
> 3. 
> At the repository API, add a new API to get job output for a particular submission Id
and allow updates on values. 

This message was sent by Atlassian JIRA

View raw message