spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: Writing streaming data to cassandra creates duplicates
Date Tue, 28 Jul 2015 06:42:22 GMT
You have to partition that data on the Spark Streaming by the primary key,
and then make sure insert data into Cassandra atomically per key, or per
set of keys in the partition. You can use the combination of the (batch
time, and partition Id) of the RDD inside foreachRDD as the unique id for
the data you are inserting. This will guard against multiple attempts to
run the task that inserts into Cassandra.

See
http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

TD

On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch <learnings.chitturi@gmail.com>
wrote:

> Hi All,
>
>  I have a problem when writing streaming data to cassandra. Or existing
> product is on Oracle DB in which while wrtiting data, locks are maintained
> such that duplicates in the DB are avoided.
>
> But as spark has parallel processing architecture, if more than 1 thread
> is trying to write same data i.e with same primary key, is there as any
> scope to created duplicates? If yes, how to address this problem either
> from spark or from cassandra side ?
>
> Thanks,
> Padma Ch
>

Mime
View raw message