I am wondering how direct stream api ensures end-to-end exactly once semanticsI think there are two things involved:1. From the spark streaming end, the driver will replay the Offset range when it's down and restarted,which means that the new tasks will process some already processed data.2. From the user end, since tasks may process already processed data, user end should detect that some data has already been processed,eg,use some unique ID.Not sure if I have understood correctly.
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
>>not being able to read from Kafka using multiple nodes
> Kafka is plenty capable of doing this..
I faced the same issue before Spark 1.3 was released.
The issue was not with Kafka, but with Spark Streaming’s Kafka connector. Before Spark 1.3.0 release one Spark worker would get all the streamed messages. We had to re-partition to distribute the processing.
From Spark 1.3.0 release the Spark Direct API for Kafka supported parallel reads from Kafka streamed to Spark workers. See the “Approach 2: Direct Approach” in this page: http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html. Note that is also mentions zero data loss and exactly once semantics for kafka integration.
From: Jordan Pilat [mailto:email@example.com]
Sent: 18 June 2015 03:57
To: Enno Shioji
Cc: Will Briggs; firstname.lastname@example.org; ayan guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan
Subject: Re: Spark or Storm
>not being able to read from Kafka using multiple nodes
Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion.
If it isn't, you still have the fault tolerance associated with clustering the consumers.
On Jun 17, 2015 1:27 AM, "Enno Shioji" <email@example.com> wrote:
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve this semantics, but is not practical if you have any significant amount of state because it does so by dumping the entire state on every checkpointing)
There are also some minor drawbacks that I'm sure will be fixed quickly, like no task timeout, not being able to read from Kafka using multiple nodes, data loss hazard with Kafka.
It's also not possible to attain very low latency in Spark, if that's what you need.
The pos for Spark is the concise and IMO more intuitive syntax, especially if you compare it with Storm's Java API.
I admit I might be a bit biased towards Storm tho as I'm more familiar with it.
Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis.
On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan <firstname.lastname@example.org> wrote:
Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does.