spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Rodríguez Hortalá <juan.rodriguez.hort...@gmail.com>
Subject Re: Writing streaming data to cassandra creates duplicates
Date Thu, 30 Jul 2015 09:03:33 GMT
Hi,

Just my two cents. I understand your problem is that your problem is that
you have messages with the same key in two different dstreams. What I would
do would be making a union of all the dstreams with StreamingContext.union
or several calls to DStream.union, and then I would create a pair dstream
with the primary key as key, and then I'd use groupByKey or reduceByKey (or
combineByKey etc) to combine the messages with the same primary key.

Hope that helps.

Greetings,

Juan


2015-07-30 10:50 GMT+02:00 Priya Ch <learnings.chitturi@gmail.com>:

> Hi All,
>
>  Can someone throw insights on this ?
>
> On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch <learnings.chitturi@gmail.com>
> wrote:
>
>>
>>
>> Hi TD,
>>
>>  Thanks for the info. I have the scenario like this.
>>
>>  I am reading the data from kafka topic. Let's say kafka has 3 partitions
>> for the topic. In my streaming application, I would configure 3 receivers
>> with 1 thread each such that they would receive 3 dstreams (from 3
>> partitions of kafka topic) and also I implement partitioner. Now there is a
>> possibility of receiving messages with same primary key twice or more, one
>> is at the time message is created and other times if there is an update to
>> any fields for same message.
>>
>> If two messages M1 and M2 with same primary key are read by 2 receivers
>> then even the partitioner in spark would still end up in parallel
>> processing as there are altogether in different dstreams. How do we address
>> in this situation ?
>>
>> Thanks,
>> Padma Ch
>>
>> On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das <tdas@databricks.com>
>> wrote:
>>
>>> You have to partition that data on the Spark Streaming by the primary
>>> key, and then make sure insert data into Cassandra atomically per key, or
>>> per set of keys in the partition. You can use the combination of the (batch
>>> time, and partition Id) of the RDD inside foreachRDD as the unique id for
>>> the data you are inserting. This will guard against multiple attempts to
>>> run the task that inserts into Cassandra.
>>>
>>> See
>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
>>>
>>> TD
>>>
>>> On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch <learnings.chitturi@gmail.com
>>> > wrote:
>>>
>>>> Hi All,
>>>>
>>>>  I have a problem when writing streaming data to cassandra. Or existing
>>>> product is on Oracle DB in which while wrtiting data, locks are maintained
>>>> such that duplicates in the DB are avoided.
>>>>
>>>> But as spark has parallel processing architecture, if more than 1
>>>> thread is trying to write same data i.e with same primary key, is there as
>>>> any scope to created duplicates? If yes, how to address this problem either
>>>> from spark or from cassandra side ?
>>>>
>>>> Thanks,
>>>> Padma Ch
>>>>
>>>
>>>
>>
>>
>

Mime
View raw message