sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jarek Jarcec Cecho (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-2465) Initializer and Destroyer should know how many executors will run
Date Fri, 07 Aug 2015 22:12:47 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662539#comment-14662539

Jarek Jarcec Cecho commented on SQOOP-2465:

I'm generally supportive to add an APIs to help connector developers build the connectors
:) I have few high level thoughts though:

* Current design is such that "From" and "To" sides are independent from the connector developer
perspective and we should keep that. Hence "From" side should not know how many loaders are
configured and vice versa "To" side should not care how many extractors are running. 
* Propagating the information "number of extractor" to "From" initializer seems completely
valid request. Similarly for number of "loaders" to "To" Initializer.
* I kind of like the idea to create an optional "To" "partitioner". Even though I would not
call it a partitioner par say as it's kind of confusing - we're not partitioning data in any
way, it's more about pre-creating temporary objects for each loader. I think that this one
is a big on itself, so perhaps we should track it in separate JIRA. I would love to see more
detailed proposal :)

> Initializer and Destroyer should know how many executors will run
> -----------------------------------------------------------------
>                 Key: SQOOP-2465
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2465
>             Project: Sqoop
>          Issue Type: Bug
>    Affects Versions: 1.99.6
>            Reporter: David Robson
> Looking at a job to load data into Oracle as an example - depending on the way the user
wants to load data, we may be loading data into temporary tables. For maximum performance
we need to create a separate temporary table for each loader - so when the initializer is
running we need to know how many loaders will run so we can create these temporary tables.
Again when the destroyer is run we will need to drop these temporary tables - so it will need
to know as well.
> Another example where we need to know this in the initializer - Oracle databases may
be real application clusters where there is multiple instances across multiple machines. For
both FROM and TO jobs we spread the load across these instances during the initialization
phase - so we need to know how many loaders / extractors will run.
> In the case of a FROM job we could do this in the partition phase - but there is no way
to achieve this for a TO job. It seems we could either add the information into the initialize
phase - or add a new partition phase on the TO side that is called after the partition phase
on the FROM side. It could take the details of the partitioned output and match it up to the
other side.

This message was sent by Atlassian JIRA

View raw message