hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Fabbri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13761) S3Guard: implement retries for DDB failures and throttling; translate exceptions
Date Fri, 02 Mar 2018 04:52:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383164#comment-16383164
] 

Aaron Fabbri commented on HADOOP-13761:
---------------------------------------

v12 patch attached which eliminates nested retries. Changes from v11:

- reopen() RetryTranslated -> OnceTranslated.  Use once() instead of retry().  All callers
(lazySeek() and callers of onReadFailure()) have their own higher-level retries.

- onReadFailure() RetryTranslated -> OnceTranslated, because reopen() no longer has internal
retries.

- Move lazySeek() in read() above the retry'd lambda; eliminates nested retry and matches
what read(b, off, len) does.

-  Remove reopen() from read(b, off, len) after wrapped stream throws EOF. *This was existing
code I changed*: I don't understand why you'd reopen() on EOF, [~stevel@apache.org]?

Tested in us-west-2.  I was able to reproduce the test timeout for ITestS3AFailureHandling,
which is now passing for me.

Running on little sleep but wanted to get a patch posted to give a chance for European Friday
code review.
 
Here is the delta from v11:
{noformat}
diff -ur ./src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java /Users/fabbri/Code/hadoop/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
--- ./src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java        2018-03-01 20:34:33.000000000
-0800
+++ /Users/fabbri/Code/hadoop/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
       2018-03-01 20:09:05.000000000 -0800
@@ -154,7 +154,7 @@
    * @param length length requested
    * @throws IOException on any failure to open the object
    */
-  @Retries.RetryTranslated
+  @Retries.OnceTranslated
   private synchronized void reopen(String reason, long targetPos, long length)
       throws IOException {

@@ -178,7 +178,7 @@
     }
     String text = String.format("Failed to %s %s at %d",
         (opencount == 0 ? "open" : "re-open"), uri, targetPos);
-    S3Object object = context.invoker.retry(text, uri, true,
+    S3Object object = context.getReadInvoker().once(text, uri,
         () -> client.getObject(request));
     wrappedStream = object.getObjectContent();
     contentRangeStart = targetPos;
@@ -349,6 +349,12 @@
       return -1;
     }

+    try {
+      lazySeek(nextReadPos, 1);
+    } catch (EOFException e) {
+      return -1;
+    }
+
     // With S3Guard, the metadatastore gave us metadata for the file in
     // open(), so we use a slightly different retry policy.
     // read() may not be likely to fail, but reopen() does a GET which
@@ -358,7 +364,6 @@
         () -> {
           int b;
           try {
-            lazySeek(nextReadPos, 1);
             b = wrappedStream.read();
           } catch (EOFException e) {
             return -1;
@@ -387,7 +392,7 @@
    * @param length length of data being attempted to read
    * @throws IOException any exception thrown on the re-open attempt.
    */
-  @Retries.RetryTranslated
+  @Retries.OnceTranslated
   private void onReadFailure(IOException ioe, int length) throws IOException {

     LOG.info("Got exception while trying to read from stream {}" +
@@ -439,7 +444,6 @@
           try {
             bytes = wrappedStream.read(buf, off, len);
           } catch (EOFException e) {
-            onReadFailure(e, len);
             // the base implementation swallows EOFs.
             return -1;
           } catch (IOException e) {

{noformat}
 

 

> S3Guard: implement retries for DDB failures and throttling; translate exceptions
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-13761
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13761
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0-beta1
>            Reporter: Aaron Fabbri
>            Assignee: Aaron Fabbri
>            Priority: Blocker
>         Attachments: HADOOP-13761-004-to-005.patch, HADOOP-13761-005-to-006-approx.diff.txt,
HADOOP-13761-005.patch, HADOOP-13761-006.patch, HADOOP-13761-007.patch, HADOOP-13761-008.patch,
HADOOP-13761-009.patch, HADOOP-13761-010.patch, HADOOP-13761-010.patch, HADOOP-13761-011.patch,
HADOOP-13761-012.patch, HADOOP-13761.001.patch, HADOOP-13761.002.patch, HADOOP-13761.003.patch,
HADOOP-13761.004.patch
>
>
> Following the S3AFileSystem integration patch in HADOOP-13651, we need to add retry logic.
> In HADOOP-13651, I added TODO comments in most of the places retry loops are needed,
including:
> - open(path).  If MetadataStore reflects recent create/move of file path, but we fail
to read it from S3, retry.
> - delete(path).  If deleteObject() on S3 fails, but MetadataStore shows the file exists,
retry.
> - rename(src,dest).  If source path is not visible in S3 yet, retry.
> - listFiles(). Skip for now. Not currently implemented in S3Guard. I will create a separate
JIRA for this as it will likely require interface changes (i.e. prefix or subtree scan).
> We may miss some cases initially and we should do failure injection testing to make sure
we're covered.  Failure injection tests can be a separate JIRA to make this easier to review.
> We also need basic configuration parameters around retry policy.  There should be a way
to specify maximum retry duration, as some applications would prefer to receive an error eventually,
than waiting indefinitely.  We should also be keeping statistics when inconsistency is detected
and we enter a retry loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message