flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishnu Viswanath (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4660) HadoopFileSystem (with S3A) may leak connections, which cause job to stuck in a restarting loop
Date Tue, 26 Sep 2017 09:09:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180517#comment-16180517
] 

Vishnu Viswanath commented on FLINK-4660:
-----------------------------------------

in which version is this fixed? I am using 1.3.1 and getting similar exception when reading
input split from S3.
{code}
2017-09-26 08:47:27,220 INFO  org.apache.flink.api.common.io.LocatableInputSplitAssigner 
  - Assigning remote split to host ip-10-150-98-185
2017-09-26 08:47:27,344 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph     
  - CHAIN DataSource (at .......Job$$anonfun$main$4$$anonfun$apply$3.apply(Job.scala:138)
(org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at ......sources.SourceSelector$.selectSource(SourceSelector.scala:17))
-> Map (from: ....) (6/8) (df8e44219270f80170e6d027b77b246f) switched from RUNNING to FAILED.
com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection
from pool
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:972)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:676)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:650)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:633)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:601)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:583)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:447)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4137)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1346)
	at io.grhodes.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:72)
	at io.grhodes.hadoop.fs.s3a.S3AInputStream.openIfNeeded(S3AInputStream.java:43)
	at io.grhodes.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:137)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:72)
	at org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:669)
	at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
	at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:48)
	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
from pool
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
	at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
	at com.amazonaws.http.conn.$Proxy16.get(Unknown Source)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1115)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:964)
	... 19 more
2017-09-26 08:47:27,345 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph     
  - Job Job_at_09/26/2017_08:44:08 (74a0b9f0eab746705ad88817849e5c4b) switched from state
RUNNING to FAILING.
com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection
from pool
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:972)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:676)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:650)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:633)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:601)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:583)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:447)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4137)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1346)
	at io.grhodes.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:72)
	at io.grhodes.hadoop.fs.s3a.S3AInputStream.openIfNeeded(S3AInputStream.java:43)
	at io.grhodes.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:137)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:72)
	at org.apache.flink.api.common.io.DelimitedInputFormat.fillBuffer(DelimitedInputFormat.java:669)
	at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:490)
	at org.apache.flink.api.common.io.DelimitedInputFormat.open(DelimitedInputFormat.java:48)
	at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:145)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
from pool
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
	at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
	at com.amazonaws.http.conn.$Proxy16.get(Unknown Source)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1115)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:964)
	... 19 more
{code}

> HadoopFileSystem (with S3A) may leak connections, which cause job to stuck in a restarting
loop
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-4660
>                 URL: https://issues.apache.org/jira/browse/FLINK-4660
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>            Reporter: Zhenzhong Xu
>            Priority: Critical
>         Attachments: Screen Shot 2016-09-20 at 2.49.14 PM.png, Screen Shot 2016-09-20
at 2.49.32 PM.png
>
>
> Flink job with checkpoints enabled and configured to use S3A file system backend, sometimes
experiences checkpointing failure due to S3 consistency issue. This behavior is also reported
by other people and documented in https://issues.apache.org/jira/browse/FLINK-4218.
> This problem gets magnified by current HadoopFileSystem implementation, which can potentially
leak S3 client connections, and eventually get into a restarting loop with “Timeout waiting
for a connection from pool” exception thrown from aws client.
> I looked at the code, seems HadoopFileSystem.java never invoke close() method on fs object
upon failure, but the FileSystem may be re-initialized every time the job gets restarted.
> A few evidence I observed:
> 1. When I set the connection pool limit to 128, and below commands shows 128 connections
are stuck in CLOSE_WAIT state.
> !Screen Shot 2016-09-20 at 2.49.14 PM.png|align=left, vspace=5! 
> 2. task manager logs indicates that state backend file system consistently getting initialized
upon job restarting.
> !Screen Shot 2016-09-20 at 2.49.32 PM.png!
> 3. Log indicates there is NPE during cleanning up of stream task which was caused by
“Timeout waiting for connection from pool” exception when trying to create a directory
in S3 bucket.
> 2016-09-02 08:17:50,886 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Error
during cleanup of stream task
> java.lang.NullPointerException
> at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.cleanup(OneInputStreamTask.java:73)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:323)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:589)
> at java.lang.Thread.run(Thread.java:745)
> 4.It appears StreamTask from invoking checkpointing operation, to handling failure, there
is no logic associated with closing Hadoop File System object (which internally includes S3
aws client object), which resides in HadoopFileSystem.java.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message