spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thakrar, Jayesh" <jthak...@conversantmedia.com>
Subject Re: Datasource API V2 and checkpointing
Date Fri, 27 Apr 2018 16:30:02 GMT
Thanks Joseph!

From: Joseph Torres <joseph.torres@databricks.com>
Date: Friday, April 27, 2018 at 11:23 AM
To: "Thakrar, Jayesh" <jthakrar@conversantmedia.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Subject: Re: Datasource API V2 and checkpointing

The precise interactions with the DataSourceV2 API haven't yet been hammered out in design.
But much of this comes down to the core of Structured Streaming rather than the API details.

The execution engine handles checkpointing and recovery. It asks the streaming data source
for offsets, and then determines that batch N contains the data between offset A and offset
B. On recovery, if batch N needs to be re-run, the execution engine just asks the source for
the same offset range again. Sources also get a handle to their own subfolder of the checkpoint,
which they can use as scratch space if they need. For example, Spark's FileStreamReader keeps
a log of all the files it's seen, so its offsets can be simply indices into the log rather
than huge strings containing all the paths.

SPARK-23323 is orthogonal. That commit coordinator is responsible for ensuring that, within
a single Spark job, two different tasks can't commit the same partition.

On Fri, Apr 27, 2018 at 8:53 AM, Thakrar, Jayesh <jthakrar@conversantmedia.com<mailto:jthakrar@conversantmedia.com>>
wrote:
Wondering if this issue is related to SPARK-23323?

Any pointers will be greatly appreciated….

Thanks,
Jayesh

From: "Thakrar, Jayesh" <jthakrar@conversantmedia.com<mailto:jthakrar@conversantmedia.com>>
Date: Monday, April 23, 2018 at 9:49 PM
To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" <dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Datasource API V2 and checkpointing

I was wondering when checkpointing is enabled, who does the actual work?
The streaming datasource or the execution engine/driver?

I have written a small/trivial datasource that just generates strings.
After enabling checkpointing, I do see a folder being created under the checkpoint folder,
but there's nothing else in there.

Same question for write-ahead and recovery?
And on a restart from a failed streaming session - who should set the offsets?
The driver/Spark or the datasource?

Any pointers to design docs would also be greatly appreciated.

Thanks,
Jayesh


Mime
View raw message