sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gwen Shapira" <gshap...@cloudera.com>
Subject Re: Review Request 21898: Patch for SQOOP-1255 - https://issues.apache.org/jira/browse/SQOOP-1255
Date Mon, 21 Jul 2014 19:39:55 GMT


> On July 19, 2014, 9:06 p.m., Jarek Cecho wrote:
> > Good work Gwen! Couple of high level notes:
> > 
> > 1) Please always put the patch on JIRA as well, we do have pre-commit build that
will test your changes. Also we can't commit your changes unless they are attached to JIRA.
> > 2) The loader tool will load entire repository dump into memory which seems fine
for now, but we might need to think about file format that would enable us process the dump
in streaming fashion in the future.

2) I seriously seriously doubt we'll ever run into an issue here. Even large Sqoop users have
around 50 jobs, we can easily have a year of daily submissions in less than 0.5G RAM. 
But in the unlikely case we'll need it - Jackson has streaming JSON Parser. The only part
that may not fit into memory is the "submissions" records, and we can load them one-by-one
easily. 


> On July 19, 2014, 9:06 p.m., Jarek Cecho wrote:
> > tools/src/main/java/org/apache/sqoop/tools/tool/JSONConstants.java, line 21
> > <https://reviews.apache.org/r/21898/diff/6/?file=635434#file635434line21>
> >
> >     I don't think that interface is the correct type here, what about final class?

Makes sense. 

Since most of those JSON constants are used all over the code, I'm wondering if it makes sense
to collect all of them into a class in org.apache.sqoop.json.util?

(As a separate Jira)


> On July 19, 2014, 9:06 p.m., Jarek Cecho wrote:
> > tools/src/main/java/org/apache/sqoop/tools/tool/RepositoryDumpTool.java, lines 143-147
> > <https://reviews.apache.org/r/21898/diff/6/?file=635435#file635435line143>
> >
> >     Wouldn't be cleaner to just use ConnectorManager.getConnector API to get the
connector name?
> >     
> >     https://github.com/apache/sqoop/blob/sqoop2/core/src/main/java/org/apache/sqoop/connector/ConnectorManager.java#L135

It looks like there are two interfaces to get connector metadata - one in ConnectorManager
and one in RepositoryManager. I was using the RepositoryManager everywhere here and didn't
notice that ConnectorManager actually provides better API.

I'm wondering if there's a good reason for the redundancy or whether we should clean this
up.


> On July 19, 2014, 9:06 p.m., Jarek Cecho wrote:
> > tools/src/main/java/org/apache/sqoop/tools/tool/RepositoryLoadTool.java, lines 137-144
> > <https://reviews.apache.org/r/21898/diff/6/?file=635436#file635436line137>
> >
> >     What about creating another parent tool that will initialize entire Sqoop infrastructure?
Something like the ConfiguredTool - we should be able to use the same tool for dump and load.

I'm not sure its a good idea - all the initialize methods take parameters, which may be different
for different tools.

In this case I'm initializing RepositoryManager as immutable in both dump and load tools,
but it should probably be mutable in load.


> On July 19, 2014, 9:06 p.m., Jarek Cecho wrote:
> > tools/src/main/java/org/apache/sqoop/tools/tool/RepositoryLoadTool.java, line 231
> > <https://reviews.apache.org/r/21898/diff/6/?file=635436#file635436line231>
> >
> >     I'm thinking if this check is indeed necessary. User might want to load older
backup to a different Sqoop version right?

Users may indeed want to (although it shouldn't be common - backup best practices include
a fresh backup immediately following a successful upgrade), it should work (we don't plan
on making backward incompatible changes) and its trivial to work around this check for any
sufficiently determined user (or support).

The reason its there is that otherwise we are implicitly promising that dump and load will
work between any two versions. I doubt we want to promise (and test!) that.

If you think its important, I can check that the repository is newer than the json. Importing
new dump into an old repo is much more likely to break (new fields in forms).


> On July 19, 2014, 9:06 p.m., Jarek Cecho wrote:
> > tools/src/main/java/org/apache/sqoop/tools/tool/RepositoryLoadTool.java, line 243
> > <https://reviews.apache.org/r/21898/diff/6/?file=635436#file635436line243>
> >
> >     Similar update code is already in the code base, so I'm wondering if it would
make sense to abstract and reuse it rather then have similar code on two places?
> >     
> >     https://github.com/apache/sqoop/blob/sqoop2/core/src/main/java/org/apache/sqoop/repository/Repository.java#L386

Lets do it in a separate Jira for refactoring? 
I don't want to add changes to the Repository into the scope of this patch.


> On July 19, 2014, 9:06 p.m., Jarek Cecho wrote:
> > tools/src/main/java/org/apache/sqoop/tools/tool/JSONConstants.java, line 18
> > <https://reviews.apache.org/r/21898/diff/6/?file=635434#file635434line18>
> >
> >     I can't help by my OCD is complaining about the location of the file - it's
in generic package, but only two tools are using it. What about creating sub-package "repository"
and put repository related tools there?

See my comment below. I agree this isn't a good place, but I think the right place is outside
Tools completely, so same constants can be used by clients, and other parts of the framework
that need to work with JSON.


> On July 19, 2014, 9:06 p.m., Jarek Cecho wrote:
> > tools/src/main/java/org/apache/sqoop/tools/tool/RepositoryLoadTool.java, lines 361-366
> > <https://reviews.apache.org/r/21898/diff/6/?file=635436#file635436line361>
> >
> >     Seems like good use case to use ConnectorManager.getConnector() API?
> >     
> >     https://github.com/apache/sqoop/blob/sqoop2/core/src/main/java/org/apache/sqoop/connector/ConnectorManager.java#L140

It would be, if it was possible. 

ConnectorManager.getConnector returns a connector. I need the connector PersistenceID, which
is part of the connector metadata.
ConnectorManager.getConnectorMetadata takes an ID as a parameter, but not a name. The repository
API has the same issue.

I'll open a separate JIRA for the cleanup. 


- Gwen


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21898/#review48179
-----------------------------------------------------------


On July 18, 2014, 4:11 p.m., Gwen Shapira wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21898/
> -----------------------------------------------------------
> 
> (Updated July 18, 2014, 4:11 p.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Repository: sqoop-sqoop2
> 
> 
> Description
> -------
> 
> Added tool for dumping user-generated data - connections, jobs and submissions. There's
an option to dump sensitive data (i.e. passwords) as well. 
> 
> 
> Diffs
> -----
> 
>   docs/src/site/sphinx/Tools.rst ad72cd1 
>   pom.xml 1e2f005 
>   tools/pom.xml 31eda1c 
>   tools/src/main/java/org/apache/sqoop/tools/tool/BuiltinTools.java b24cb35 
>   tools/src/main/java/org/apache/sqoop/tools/tool/JSONConstants.java PRE-CREATION 
>   tools/src/main/java/org/apache/sqoop/tools/tool/RepositoryDumpTool.java PRE-CREATION

>   tools/src/main/java/org/apache/sqoop/tools/tool/RepositoryLoadTool.java PRE-CREATION

> 
> Diff: https://reviews.apache.org/r/21898/diff/
> 
> 
> Testing
> -------
> 
> Manual testing. Dumping repository with and without sensitive data. Validating resulting
JSON.
> 
> 
> Thanks,
> 
> Gwen Shapira
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message