spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Mahadevan <ar...@apache.org>
Subject Re: use rocksdb for spark structured streaming (SSS)
Date Sun, 10 Mar 2019 19:51:16 GMT
Read the link carefully,

This solution is available (*only*) in Databricks Runtime.

You can enable RockDB-based state management by setting the following
configuration in the SparkSession before starting the streaming query.

spark.conf.set(
  "spark.sql.streaming.stateStore.providerClass",
  "com.databricks.sql.streaming.state.RocksDBStateStoreProvider")


On Sun, 10 Mar 2019 at 11:54, Lian Jiang <jiangok2006@gmail.com> wrote:

> Hi,
>
> I have a very simple SSS pipeline which does:
>
> val query = df
>   .dropDuplicates(Array("Id", "receivedAt"))
>   .withColumn(timePartitionCol, timestamp_udfnc(col("receivedAt")))
>   .writeStream
>   .format("parquet")
>   .partitionBy("availabilityDomain", timePartitionCol)
>   .trigger(Trigger.ProcessingTime(5, TimeUnit.MINUTES))
>   .option("path", "/data")
>   .option("checkpointLocation", "/data_checkpoint")
>   .start()
>
> After ingesting 2T records, the state under checkpoint folder on HDFS (replicator factor
2) grows to 2T bytes.
> My cluster has only 2T bytes which means the cluster can barely handle further data growth.
>
> Online spark documents (https://docs.databricks.com/spark/latest/structured-streaming/production.html)
> says using rocksdb help SSS job reduce JVM memory overhead. But I cannot find any document
how
>
> to setup rocksdb for SSS. Spark class CheckpointReader seems to only handle HDFS.
>
> Any suggestions? Thanks!
>
>
>
>

Mime
View raw message