flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maximilian Michels (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1636) Misleading exception during concurrent partition release and remote request
Date Tue, 21 Apr 2015 13:02:59 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504908#comment-14504908
] 

Maximilian Michels commented on FLINK-1636:
-------------------------------------------

The error above is based on an older code base (e.g. 9d7acf3657cbd3fb0b238b20ba864b6a74774e40).
Some work has been done on the RemoteInputChannel. The problem should still persists though.
On the master (e2a00183eb539889d7a5053c49b2a296de79add0) this currently looks similar:

{code}
if (expectedSequenceNumber == sequenceNumber) {
						receivedBuffers.add(buffer);
						expectedSequenceNumber++;

						notifyAvailableBuffer();

						success = true;
					}
					else {
						onError(new BufferReorderingException(expectedSequenceNumber, sequenceNumber));
					}
{code}

This error only gets thrown when the sequence number of the buffer doesn't match. However,
your description states a different problem, i.e. the buffer has already been released on
the remote side. That should be detected by other means than the sequence number (e.g. by
an identifier). So do we need an additional check in the code or is the error cause you described
not applicable here?

> Misleading exception during concurrent partition release and remote request
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-1636
>                 URL: https://issues.apache.org/jira/browse/FLINK-1636
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Runtime
>            Reporter: Ufuk Celebi
>            Priority: Minor
>
> When a result partition is released concurrently with a remote partition request, the
request might come in late and result in an exception at the receiving task saying:
> {code}
> 16:04:22,499 INFO  org.apache.flink.runtime.taskmanager.Task                     - CHAIN
Partition -> Map (Map at testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (1/4)
switched to FAILED : java.io.IOException: org.apache.flink.runtime.io.network.partition.queue.IllegalQueueIteratorRequestException
at remote input channel: Intermediate result partition has already been released.].
> 	at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.checkIoError(RemoteInputChannel.java:223)
> 	at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.getNextBuffer(RemoteInputChannel.java:103)
> 	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:310)
> 	at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75)
> 	at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
> 	at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
> 	at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91)
> 	at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
> 	at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
> 	at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message