spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rachana Srivastava <>
Subject How to manage offsets in Spark Structured Streaming?
Date Wed, 17 Jun 2020 16:00:29 GMT
  Background: I have written a simple spark structured steaming app to move data from Kafka
to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata
folder, which ends up growing too large, when the streaming app runs for a long time the metadata
folder grows so big that we start getting OOM errors.   I want to get rid of metadata and
checkpoint folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges 
to get offsets in Spark Structured Streaming.  But want to know how to get the offsets and
other metadata to manage checkpointing ourself using Spark Structured Streaming.  Do you
have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA looks like offsets
are not provided.  How should we go about?

View raw message