spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priya Ch <learnings.chitt...@gmail.com>
Subject Re: Writing streaming data to cassandra creates duplicates
Date Thu, 30 Jul 2015 08:50:38 GMT
Hi All,

 Can someone throw insights on this ?

On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch <learnings.chitturi@gmail.com>
wrote:

>
>
> Hi TD,
>
>  Thanks for the info. I have the scenario like this.
>
>  I am reading the data from kafka topic. Let's say kafka has 3 partitions
> for the topic. In my streaming application, I would configure 3 receivers
> with 1 thread each such that they would receive 3 dstreams (from 3
> partitions of kafka topic) and also I implement partitioner. Now there is a
> possibility of receiving messages with same primary key twice or more, one
> is at the time message is created and other times if there is an update to
> any fields for same message.
>
> If two messages M1 and M2 with same primary key are read by 2 receivers
> then even the partitioner in spark would still end up in parallel
> processing as there are altogether in different dstreams. How do we address
> in this situation ?
>
> Thanks,
> Padma Ch
>
> On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das <tdas@databricks.com>
> wrote:
>
>> You have to partition that data on the Spark Streaming by the primary
>> key, and then make sure insert data into Cassandra atomically per key, or
>> per set of keys in the partition. You can use the combination of the (batch
>> time, and partition Id) of the RDD inside foreachRDD as the unique id for
>> the data you are inserting. This will guard against multiple attempts to
>> run the task that inserts into Cassandra.
>>
>> See
>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
>>
>> TD
>>
>> On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch <learnings.chitturi@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>>  I have a problem when writing streaming data to cassandra. Or existing
>>> product is on Oracle DB in which while wrtiting data, locks are maintained
>>> such that duplicates in the DB are avoided.
>>>
>>> But as spark has parallel processing architecture, if more than 1 thread
>>> is trying to write same data i.e with same primary key, is there as any
>>> scope to created duplicates? If yes, how to address this problem either
>>> from spark or from cassandra side ?
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>
>>
>
>

Mime
View raw message