sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nilesh Maheshwari (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SQOOP-1312) One of mappers does not load data from mySql if double column is used as split key
Date Tue, 23 Aug 2016 21:09:22 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433605#comment-15433605
] 

Nilesh Maheshwari edited comment on SQOOP-1312 at 8/23/16 9:08 PM:
-------------------------------------------------------------------

[~jarcec] - I am running into a similar issue when extracting data from SAP Sybase database.
In my case, it is splitting based on following conditions on `invoiceid` column which is a
float datatype:

16/08/23 20:48:55 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound
'invoiceid >= 1054.0' and upper bound 'invoiceid < 469505.75'
16/08/23 20:48:55 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound
'invoiceid >= 469505.75' and upper bound 'invoiceid < 937957.5'
16/08/23 20:48:55 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound
'invoiceid >= 937957.5' and upper bound 'invoiceid < 1406409.25'
16/08/23 20:48:55 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound
'invoiceid >= 1874861.0' and upper bound 'invoiceid <= 1874861.0'

Observe the last split filter condition where the lower bound is not same as the previous
split's upper bound. This is causing missed rows in the sqoop import.

In this case invoiceid min and max values are: 1054 and 1874861 respectively.

This is a major issue when importing tables with float type primary columns and results in
missed rows without any warnings/errors.

Sqoop version: 1.4.6





was (Author: maheshwari.nilesh@gmail.com):
[~jarcec] - I am running into a similar issue when extracting data from SAP Sybase database.
In my case, it is splitting based on following conditions on `invoiceid` column which is a
float datatype:

`
16/08/23 20:48:55 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound
'invoiceid >= 1054.0' and upper bound 'invoiceid < 469505.75'
16/08/23 20:48:55 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound
'invoiceid >= 469505.75' and upper bound 'invoiceid < 937957.5'
16/08/23 20:48:55 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound
'invoiceid >= 937957.5' and upper bound 'invoiceid < 1406409.25'
16/08/23 20:48:55 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound
'invoiceid >= 1874861.0' and upper bound 'invoiceid <= 1874861.0'
`

Observe the last split filter condition where the lower bound is not same as the previous
split's upper bound. This is causing missed rows in the sqoop import.

In this case invoiceid min and max values are: 1054 and 1874861 respectively.

This is a major issue when importing tables with float type primary columns and results in
missed rows without any warnings/errors.

Sqoop version: 1.4.6




> One of mappers does not load data from mySql if double column is used as split key
> ----------------------------------------------------------------------------------
>
>                 Key: SQOOP-1312
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1312
>             Project: Sqoop
>          Issue Type: Bug
>    Affects Versions: 1.4.4
>            Reporter: Jong Ho Lee
>            Assignee: Jong Ho Lee
>         Attachments: splitter.patch, splitter.patch
>
>
> When we used Sqoop to load data from mySQL using one double column as split-key in Samsung
SDS,
>   the last mapper did not load data from mySQL at all. 
>   The number of mappers was sometimes increased by 1.
>   I think they were caused by some bugs in FloatSplitter.java
>   For the last split, lowClausePrefix + Double.toString(curUpper), may be lowClausePrefix
+ Double.toString(curLower).
>   In while (curUpper < maxVal) loop, because of round-off error, 
>   minVal + splitSize * numSplits can be smaller than maxVal.
>   Therefore, using for-loop would be better.
>   Attached is a proposed new FloatSplitter.java
> {code}
> /**
>  * Licensed to the Apache Software Foundation (ASF) under one
>  * or more contributor license agreements.  See the NOTICE file
>  * distributed with this work for additional information
>  * regarding copyright ownership.  The ASF licenses this file
>  * to you under the Apache License, Version 2.0 (the
>  * "License"); you may not use this file except in compliance
>  * with the License.  You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>  * See the License for the specific language governing permissions and
>  * limitations under the License.
>  */
> // modified by Jongho Lee at Samsung SDS.
> package org.apache.sqoop.mapreduce.db;
> import java.sql.ResultSet;
> import java.sql.SQLException;
> import java.util.ArrayList;
> import java.util.List;
> import org.apache.commons.logging.Log;
> import org.apache.commons.logging.LogFactory;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.mapreduce.InputSplit;
> import com.cloudera.sqoop.config.ConfigurationHelper;
> import com.cloudera.sqoop.mapreduce.db.DBSplitter;
> import com.cloudera.sqoop.mapreduce.db.DataDrivenDBInputFormat;
> /**
>  * Implement DBSplitter over floating-point values.
>  */
> public class FloatSplitter implements DBSplitter  {
>   private static final Log LOG = LogFactory.getLog(FloatSplitter.class);
>   private static final double MIN_INCREMENT = 10000 * Double.MIN_VALUE;
>   public List<InputSplit> split(Configuration conf, ResultSet results,
>       String colName) throws SQLException {
>     LOG.warn("Generating splits for a floating-point index column. Due to the");
>     LOG.warn("imprecise representation of floating-point values in Java, this");
>     LOG.warn("may result in an incomplete import.");
>     LOG.warn("You are strongly encouraged to choose an integral split column.");
>     List<InputSplit> splits = new ArrayList<InputSplit>();
>     if (results.getString(1) == null && results.getString(2) == null) {
>       // Range is null to null. Return a null split accordingly.
>       splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit(
>           colName + " IS NULL", colName + " IS NULL"));
>       return splits;
>     }
>     double minVal = results.getDouble(1);
>     double maxVal = results.getDouble(2);
>     // Use this as a hint. May need an extra task if the size doesn't
>     // divide cleanly.
>     int numSplits = ConfigurationHelper.getConfNumMaps(conf);
>     double splitSize = (maxVal - minVal) / (double) numSplits;
>     if (splitSize < MIN_INCREMENT) {
>       splitSize = MIN_INCREMENT;
>     }
>     String lowClausePrefix = colName + " >= ";
>     String highClausePrefix = colName + " < ";
>     double curLower = minVal;
>     double curUpper = curLower + splitSize;
>     for (int i = 0; i < numSplits - 1; i++) {
>       // while (curUpper < maxVal) {  // changed to for loop
>       splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit(
>           lowClausePrefix + Double.toString(curLower),
>           highClausePrefix + Double.toString(curUpper)));
>       curLower = curUpper;
>       curUpper += splitSize;
>     }
>     // Catch any overage and create the closed interval for the last split.
>     if (curLower <= maxVal || splits.size() == 1) {
>       splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit(
>           lowClausePrefix + Double.toString(curLower),
>           colName + " <= " + Double.toString(maxVal)));
>       // curUpper -> curLower // changed
>     }
>     if (results.getString(1) == null || results.getString(2) == null) {
>       // At least one extrema is null; add a null split.
>       splits.add(new DataDrivenDBInputFormat.DataDrivenDBInputSplit(
>           colName + " IS NULL", colName + " IS NULL"));
>     }
>     return splits;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message