sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lizhanqiang@inspur.com" <lizhanqi...@inspur.com>
Subject Re: Re: the confusion of --split-by parameter
Date Wed, 10 Sep 2014 00:31:05 GMT


Hey,brother.
  Glad to hear from you!I think we can use limit/offset(if the database support this operation),or
we can use sub-selection(if the database does not support limint/offset)
For example:
For MySQL:select * from table limiit 0,5;select * from table limit 6,10...
For Oracle we can use rownum 
I just can not understand why sqoop override this opreation above.This override can lead to
data skew.
 
From: Abraham Elmahrek
Date: 2014-09-10 00:38
To: user@sqoop.apache.org
Subject: Re: the confusion of --split-by parameter
Hey there,

For databases, there needs to be a way to actually infer boundaries for a particular column.
Simply performing a "select *" would not be enough because we would not know how to query
the database.

-Abe

On Mon, Sep 8, 2014 at 8:33 PM, lizhanqiang@inspur.com <lizhanqiang@inspur.com> wrote:
Hi,all.
   In sqoop we can specify the parameter --split-by,which can determine which field we will
use to split map recored.
But if the split field's data is skew.The workload between maps will be imbalance.I want to
know why sqoop does not use 
select count(*) from table/num-maps to determine each map's workload.As I know some other
base class of  DataDrivenDBInputFormat's
has the implementation of select count(*) from table/num-maps.Then why sqoop override this.



Mime
View raw message