sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jarek Jarcec Cecho (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SQOOP-331) Support boundary query on the command line
Date Sun, 04 Sep 2011 18:37:09 GMT

     [ https://issues.apache.org/jira/browse/SQOOP-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jarek Jarcec Cecho updated SQOOP-331:
-------------------------------------

    Status: Patch Available  (was: Open)

I've add new parameter for boundary query and propagated it to all places where it looked
needed. However I'm not sure that I've propagated everywhere, so it would be nice to get some
sort of feedback here.

Also since boundary query is only optional configuration property and might not be used at
all, I'm not sure how to construct tests for it. So far I've included only test for parsing
parameters.

Any sort of feedback would be greatly appreciated.

> Support boundary query on the command line
> ------------------------------------------
>
>                 Key: SQOOP-331
>                 URL: https://issues.apache.org/jira/browse/SQOOP-331
>             Project: Sqoop
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 1.4.0
>            Reporter: Jarek Jarcec Cecho
>            Assignee: Jarek Jarcec Cecho
>         Attachments: SQOOP-331.patch
>
>
> It would be nice if the sqoop would have ability to specify query that will fetch minimal
and maximal value for creating splits in DataDrivenDBInputFormat from the command line.
> Normally sqoop will generate query to get maximal and minimal value for creating splits
in following form: SELECT min($split_by_column), max($split_by_column) FROM $table WHERE $cmd_where.
In my use case, I needed to import only portion of data with ranges based on the split_by_column
that I already have preselected and that are available in special table that holds data ranges
and appropriate primary key values. So my auto generated query looked like this: SELECT min(id),
max(id) FROM table WHERE id => min_id and id <= max_id. That query is obviously useless
and is just creating unnecessary load on the database server. It would be nice to supply my
own boundary query that will use the extra table with data ranges.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message