sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arvind Prabhakar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-331) Support boundary query on the command line
Date Tue, 06 Sep 2011 18:53:09 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098262#comment-13098262
] 

Arvind Prabhakar commented on SQOOP-331:
----------------------------------------

Thanks for the excellent patch Jarek. It is almost ready for commit except for a few things:

* The case for free-form query needs to be handled. This would be in the {{DataDrivenImportJob}}
class within the {{else}} block below your current modifications. Here is a sample diff that
I cooked up quickly to convey this point:
{noformat}
--- src/java/com/cloudera/sqoop/mapreduce/DataDrivenImportJob.java	(revision 1165478)
+++ src/java/com/cloudera/sqoop/mapreduce/DataDrivenImportJob.java	(working copy)
@@ -153,15 +153,26 @@
         DataDrivenDBInputFormat.setInput(job, DBWritable.class,
             mgr.escapeTableName(tableName), whereClause,
             mgr.escapeColName(splitByCol), sqlColNames);
+
+        // If user specified boundary query on the command line propagate it to
+        // the job
+        if(options.getBoundaryQuery() != null) {
+          DataDrivenDBInputFormat.setBoundingQuery(job.getConfiguration(),
+                  options.getBoundaryQuery());
+        }
       } else {
         // Import a free-form query.
         String inputQuery = options.getSqlQuery();
         String sanitizedQuery = inputQuery.replace(
             DataDrivenDBInputFormat.SUBSTITUTE_TOKEN, " (1 = 1) ");
 
-        String inputBoundingQuery =
-            mgr.getInputBoundsQuery(splitByCol, sanitizedQuery);
+        String inputBoundingQuery = options.getBoundaryQuery();
+
         if (inputBoundingQuery == null) {
+          mgr.getInputBoundsQuery(splitByCol, sanitizedQuery);
+        }
+
+        if (inputBoundingQuery == null) {
             inputBoundingQuery = "SELECT MIN(" + splitByCol + "), MAX("
                     + splitByCol + ") FROM (" + sanitizedQuery + ") AS t1";
         }

{noformat}

* Second: since you have introduced a new command line option, it is necessary that the userguide
and man pages be updated. These are located under {{src/docs}} directory and can be built
using {{ant docs}} target. In order to build them though, you would need to have {{asciidoc}}
isntalled on your machine.
* Bonus nit: there is a checkstyle violation in ImportTool:530 where the line is longer than
80 characters.

Apart from that everything looks great. Some suggestions going forward:
* Usually when introducing a new functionality, it is required to have at least one test that
exercises that functionality. The test you have added is good but does not really exercise
the functionality. 
* We use Apache Review Board (https://reviews.apache.org/) to post reviews for patches that
are longer than a few lines. This helps the reviewers give contextual feedback where necessary.


Please let me know if you have any questions for me on these suggestions.


> Support boundary query on the command line
> ------------------------------------------
>
>                 Key: SQOOP-331
>                 URL: https://issues.apache.org/jira/browse/SQOOP-331
>             Project: Sqoop
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 1.4.0
>            Reporter: Jarek Jarcec Cecho
>            Assignee: Jarek Jarcec Cecho
>         Attachments: SQOOP-331.patch
>
>
> It would be nice if the sqoop would have ability to specify query that will fetch minimal
and maximal value for creating splits in DataDrivenDBInputFormat from the command line.
> Normally sqoop will generate query to get maximal and minimal value for creating splits
in following form: SELECT min($split_by_column), max($split_by_column) FROM $table WHERE $cmd_where.
In my use case, I needed to import only portion of data with ranges based on the split_by_column
that I already have preselected and that are available in special table that holds data ranges
and appropriate primary key values. So my auto generated query looked like this: SELECT min(id),
max(id) FROM table WHERE id => min_id and id <= max_id. That query is obviously useless
and is just creating unnecessary load on the database server. It would be nice to supply my
own boundary query that will use the extra table with data ranges.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message