hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [hadoop] kai33 commented on a change in pull request #919: HADOOP-16158. DistCp to support checksum validation when copy blocks in parallel
Date Fri, 16 Aug 2019 19:27:10 GMT
kai33 commented on a change in pull request #919: HADOOP-16158. DistCp to support checksum
validation when copy blocks in parallel
URL: https://github.com/apache/hadoop/pull/919#discussion_r314858134
 
 

 ##########
 File path: hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java
 ##########
 @@ -583,6 +583,66 @@ public static boolean checksumsAreEqual(FileSystem sourceFS, Path source,
             sourceChecksum.equals(targetChecksum));
   }
 
+  /**
+   * Utility to compare file lengths and checksums for source and target.
+   *
+   * @param sourceFS FileSystem for the source path.
+   * @param source The source path.
+   * @param sourceChecksum The checksum of the source file. If it is null we
+   * still need to retrieve it through sourceFS.
+   * @param targetFS FileSystem for the target path.
+   * @param target The target path.
+   * @param skipCrc The flag to indicate whether to skip checksums.
+   * @throws IOException if there's a mismatch in file lengths or checksums.
+   */
+  public static void compareFileLengthsAndChecksums(
+      FileSystem sourceFS, Path source, FileChecksum sourceChecksum,
+      FileSystem targetFS, Path target, boolean skipCrc) throws IOException {
+    long srcLen = sourceFS.getFileStatus(source).getLen();
+    long tgtLen = targetFS.getFileStatus(target).getLen();
+    if (srcLen != tgtLen) {
+      throw new IOException(
+          "Mismatch in length of source:" + source + " (" + srcLen
+              + ") and target:" + target + " (" + tgtLen + ")");
+    }
+
+    //At this point, src & dest lengths are same. if length==0, we skip checksum
+    if ((srcLen != 0) && (!skipCrc)) {
+      if (!checksumsAreEqual(sourceFS, source, sourceChecksum,
+          targetFS, target)) {
+        StringBuilder errorMessage =
+            new StringBuilder("Checksum mismatch between ")
+                .append(source).append(" and ").append(target).append(".");
+        boolean addSkipHint = false;
+        String srcScheme = sourceFS.getScheme();
+        String targetScheme = targetFS.getScheme();
+        if (!srcScheme.equals(targetScheme)) {
+          // the filesystems are different and they aren't both hdfs connectors
+          errorMessage.append("Source and destination filesystems are of"
+              + " different types\n")
+              .append("Their checksum algorithms may be incompatible");
+          addSkipHint = true;
+        } else if (sourceFS.getFileStatus(source).getBlockSize() !=
+            targetFS.getFileStatus(target).getBlockSize()) {
+          errorMessage.append(" Source and target differ in block-size.\n")
+              .append(" Use -pb to preserve block-sizes during copy.");
+          addSkipHint = true;
+        }
+        if (addSkipHint) {
+          errorMessage
+              .append(" You can choose file-level checksum validation via "
+                  + "-Ddfs.checksum.combine.mode=COMPOSITE_CRC when block-sizes"
+                  + " or filesystems are different.")
+              .append(" Or you can skip checksum-checks altogether "
+                  + " with -skipcrccheck.\n")
+              .append(" (NOTE: By skipping checksums, one runs the risk of " +
+                  "masking data-corruption during file-transfer.)\n");
+        }
+        throw new IOException(errorMessage.toString());
 
 Review comment:
   correct me if I'm wrong, when dfs.checksum.combine.mode=COMPOSITE_CRC, it has tolerated
different filesystems / block-sizes and no exception is thrown. I can add a test for DistCp
to confirm this.
   
   here is the call stack about how it's done:
   DistCpUtils.compareFileLengthsAndChecksums (new in this PR) 
   -> DistCpUtils.checksumsAreEqual (existing, [link](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L567))
   -> FileSystem.getFileChecksum
   -> (for DistributedFileSystem) DFSClient.getFileChecksumWithCombineMode (where it handles
the checksum combine mode)
   
   So when the checksum combine mode is set, it's done at the FileSystem layer and should
be transparent to DistCp

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message