flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tzu-Li (Gordon) Tai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-11046) ElasticSearch6Connector cause thread blocked when index failed with retry
Date Fri, 14 Dec 2018 11:54:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721314#comment-16721314
] 

Tzu-Li (Gordon) Tai commented on FLINK-11046:
---------------------------------------------

Ah ...., I think I understand the deadlock now.

It's because the we're calling the user provided {{ActionRequestFailureHandler.onFailure(...)}} 
inside the {{afterBulk}} callback. The lock is only released when the {{afterBulk}} method
returns.

So, the deadlock is:
1. The {{BulkProcessor}} flushes, and one of the document indexing failed, which invokes the user's {{ActionRequestFailureHandler.onFailure(...)}}.
At this point, the lock on {{BulkProcessor}} isn't released yet, because the {{onFailure}}
call is part of the bulk processor's flush callback.
2. Within {{ActionRequestFailureHandler.onFailure(...)}}, in your case you added some new
documents to be indexed. Upon adding, the {{BulkProcessor}} would try to flush again, but
the lock wasn't released yet and therefore deadlock.

So, the re-indexing thread (i.e. the async callback) should have been blocked on:
[https://github.com/elastic/elasticsearch/blob/v6.3.1/server/src/main/java/org/elasticsearch/action/bulk/BulkRequestHandler.java#L60]

While the main task thread should have been blocked on:
[https://github.com/elastic/elasticsearch/blob/v6.3.1/server/src/main/java/org/elasticsearch/action/bulk/BulkRequestHandler.java#L86]

 

Could you confirm this and see if the analysis makes sense to you?
[~luoguohao]

> ElasticSearch6Connector cause thread blocked when index failed with retry
> -------------------------------------------------------------------------
>
>                 Key: FLINK-11046
>                 URL: https://issues.apache.org/jira/browse/FLINK-11046
>             Project: Flink
>          Issue Type: Bug
>          Components: ElasticSearch Connector
>    Affects Versions: 1.6.2
>            Reporter: luoguohao
>            Priority: Major
>
> When i'm using es6 sink to index into es, bulk process with some exception catched,
and  i trying to reindex the document with the call `indexer.add(action)` in the `ActionRequestFailureHandler.onFailure()`
method, but things goes incorrect. The call thread stuck there, and with the thread dump,
i saw the `bulkprocessor` object was locked by other thread. 
> {code:java}
> public interface ActionRequestFailureHandler extends Serializable {
>  void onFailure(ActionRequest action, Throwable failure, int restStatusCode, RequestIndexer
indexer) throws Throwable;
> }
> {code}
> After i read the code implemented in the `indexer.add(action)`, i find that `synchronized`
is needed on each add operation.
> {code:java}
> private synchronized void internalAdd(DocWriteRequest request, @Nullable Object payload)
{
>   ensureOpen();
>   bulkRequest.add(request, payload);
>   executeIfNeeded();
> }
> {code}
> And, at i also noticed that `bulkprocessor` object would also locked in the bulk process
thread. 
> the bulk process operation is in the following code:
> {code:java}
> public void execute(BulkRequest bulkRequest, long executionId) {
>     Runnable toRelease = () -> {};
>     boolean bulkRequestSetupSuccessful = false;
>     try {
>         listener.beforeBulk(executionId, bulkRequest);
>         semaphore.acquire();
>         toRelease = semaphore::release;
>         CountDownLatch latch = new CountDownLatch(1);
>         retry.withBackoff(consumer, bulkRequest, new ActionListener<BulkResponse>()
{
>             @Override
>             public void onResponse(BulkResponse response) {
>                 try {
>                     listener.afterBulk(executionId, bulkRequest, response);
>                 } finally {
>                     semaphore.release();
>                     latch.countDown();
>                 }
>             }
>             @Override
>             public void onFailure(Exception e) {
>                 try {
>                     listener.afterBulk(executionId, bulkRequest, e);
>                 } finally {
>                     semaphore.release();
>                     latch.countDown();
>                 }
>             }
>         }, Settings.EMPTY);
>         bulkRequestSetupSuccessful = true;
>        if (concurrentRequests == 0) {
>            latch.await();
>         }
>     } catch (InterruptedException e) {
>         Thread.currentThread().interrupt();
>         logger.info(() -> new ParameterizedMessage("Bulk request {} has been cancelled.",
executionId), e);
>         listener.afterBulk(executionId, bulkRequest, e);
>     } catch (Exception e) {
>         logger.warn(() -> new ParameterizedMessage("Failed to execute bulk request
{}.", executionId), e);
>         listener.afterBulk(executionId, bulkRequest, e);
>     } finally {
>         if (bulkRequestSetupSuccessful == false) {  // if we fail on client.bulk() release
the semaphore
>             toRelease.run();
>         }
>     }
> }
> {code}
> As the read line i marked above, i think, that's the reason why the retry operation thread
was block, because the the bulk process thread never release the lock on `bulkprocessor`. 
and, i also trying to figure out why the field `concurrentRequests` was set to zero. And i
saw the the initialize for bulkprocessor in class `ElasticsearchSinkBase`:
> {code:java}
> protected BulkProcessor buildBulkProcessor(BulkProcessor.Listener listener) {
>  ...
>  BulkProcessor.Builder bulkProcessorBuilder =      callBridge.createBulkProcessorBuilder(client,
listener);
>  // This makes flush() blocking
>  bulkProcessorBuilder.setConcurrentRequests(0);
>  
>  ...
>  return bulkProcessorBuilder.build();
> }
> {code}
>  this field value was set to zero explicitly. So, all things seems to make sense, but
i still wonder why the retry operation is not in the same thread as the bulk process execution,
after i read the code, `bulkAsync` method might be the last puzzle.
> {code:java}
> @Override
> public BulkProcessor.Builder createBulkProcessorBuilder(RestHighLevelClient client, BulkProcessor.Listener
listener) {
>  return BulkProcessor.builder(client::bulkAsync, listener);
> }
> {code}
> So, I hope someone can help to fix this problem, or given some suggestions, and also i
can make a try to take it. 
>  Thanks a lot !



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message