sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jarek Jarcec Cecho (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-1803) JobManager and Execution Engine changes: Support for a injecting and pulling out configs and job output in connectors
Date Mon, 16 Mar 2015 14:37:38 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363273#comment-14363273
] 

Jarek Jarcec Cecho commented on SQOOP-1803:
-------------------------------------------

Thank you for putting it together [~vybs].

Indeed the current {{MutableContext}} serializes all the data as Strings, but that is just
an internal way that is modeling what Hadoop's {{Configuration}} has been doing. We're still
exposing {{setBoolean}}, {{setInt}}, ... methods and their {{getType}} alternatives. Connector
developer can store any type in {{Context}}. It's however his responsibility to remember what
type has been stored there (e.g. we do not persist information that property "X" has been
saved as long). The {{MutableContext}} is not persisted in our repository and is more meant
as transient store specific to given submission. I believe that the context is fully lost
after the submission ends. Hence I think that we should have a contract in the connector API
somewhere that given the context object can update the appropriate configuration. Couple of
ideas:

1) We currently call the {{[Initializer.initialize()|https://github.com/apache/sqoop/blob/sqoop2/connector/connector-sdk/src/main/java/org/apache/sqoop/job/etl/Initializer.java#L47]}}
on every job initialization (both in From and To context). We could allow the connector to
change the given configuration objects. If and only if the job is successful we would persist
the updated configuration objects in repository via normal update path (the same one that
is used by user). As the job submission is asynchronous we might need to came up with mechanism
how to persist the updated configuration objects with the Hadoop job itself and get the back
later.

*Pros:* Seems relatively simple to implement as we are already preserving a lot of information
with the Hadoop job itself.
*Cons:* We would introduce kind of "implied" or "secret" API as the connector developer have
to know that he is allowed to change the configuration objects.

2) Alternatively we could expose and explicit API {{updateConfigurationObjects(Context, LinkConfiguration,
JobConfiguration)}} (proper name pending) that connector developer could explicitly implement
if he cares about updating the configuration objects. As this API would make sense only after
the job is successfully finished, we could:

*Pros:* We have an explicit API with nicely defined semantics. We don't need to persist any
additional information in the Hadoop job object.

2.1) Introduce it as part of [Destroyer|https://github.com/apache/sqoop/blob/sqoop2/connector/connector-sdk/src/main/java/org/apache/sqoop/job/etl/Destroyer.java]

*Pros:* Updating the configuration objects is part of the clean up phase so it make sense
to have it as part of {{Destroyer}}.
*Cons:* Currently the {{Destroyer}} runs outside of the Sqoop 2 server somewhere on the cluster.
We would either have to move the {{Destroyer}} to be executed in the server or simply call
this particular method in different instance of the {{Destroyer}} - and that might be a bit
confusing.

2.2) Introduce a new part of the workflow that will be executed post Destroyer. Something
liked {{Updater}}.

*Pros:*  We can easily run it on Sqoop 2 server itself without moving/caring about where the
{{Destroyer}} runs.
*Cons:* Seems weird to have a part of workflow that is executed post finalized step. Especially
when it have the same semantics as the {{Destroyer}} (we will call it exactly once, on one
node).

I'm sure that there are other ways how to expose the contract in the connector interface,
so don't hesitate and jump in with other ideas!

> JobManager and Execution Engine changes: Support for a injecting and pulling out configs
and job output in connectors 
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SQOOP-1803
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1803
>             Project: Sqoop
>          Issue Type: Sub-task
>            Reporter: Veena Basavaraj
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.6
>
>
> The details are in the design wiki, as the implementation happens more discussions can
happen here.
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+And+Merge+Design#DeltaFetchAndMergeDesign-Howtogetoutputfromconnectortosqoop?
> The goal is to dynamically inject a IncrementalConfig instance into the FromJobConfiguration.
The current MFromConfig and MToConfig can already hold a list of configs, and a strong sentiment
was expressed to keep it as a list, why not for the first time actually make use of it and
group the incremental related configs in one config object
> This task will prepare the FromJobConfiguration from the job config data, ExtractorContext
with the relevant values from the prev job run 
> This task will prepare the ToJobConfiguration from the job config data, LoaderContext
with the relevant values from the prev job run if any
> We will use DistributedCache to get State information from the Extractor and Loader out
and finally persist it into the sqoop repository depending on SQOOP-1804 once the outputcommitter
commit is called



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message