spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evgeny Shishkin <itparan...@gmail.com>
Subject Re: KafkaInputDStream mapping of partitions to tasks
Date Thu, 27 Mar 2014 22:55:45 GMT

On 28 Mar 2014, at 01:38, Evgeny Shishkin <itparanoia@gmail.com> wrote:

> 
> On 28 Mar 2014, at 01:32, Tathagata Das <tathagata.das1565@gmail.com> wrote:
> 
>> Yes, no one has reported this issue before. I just opened a JIRA on what I think
is the main problem here
>> https://spark-project.atlassian.net/browse/SPARK-1340
>> Some of the receivers dont get restarted. 
>> I have a bunch refactoring in the NetworkReceiver ready to be posted as a PR that
should fix this. 
>> 

Regarding this Jira
by default spark commits offsets to zookeeper every so seconds.
Even if you fix reconnect to kafka, we do not know from which offsets it will begin to consume.
So it would not recompute rdd as it should. It will receive arbitrary data. From the past,
or from the future.
With high-level consumer we just do not have control over this.

Hith-level consumer should not be used in production with spark. Period.
Spark should use low-level consumer, control offsets and partition assignment deterministically.

Like storm does.

>> Regarding the second problem, I have been thinking of adding flow control (i.e. limiting
the rate of receiving) for a while, just havent gotten around to it. 
>> I added another JIRA for that for tracking this issue.
>> https://spark-project.atlassian.net/browse/SPARK-1341
>> 
>> 

I think if we have fixed kafka input like above. We can control such window automatically
— like tcp window, slow start, and such.
But it will be great to have some fix available now anyway.


> 
> Thank you, i will participate and can provide testing of new code.
> Sorry for capslock, i just debugged this whole day, literally. 
> 
> 
>> TD
>> 
>> 
>> On Thu, Mar 27, 2014 at 3:23 PM, Evgeny Shishkin <itparanoia@gmail.com> wrote:
>> 
>> On 28 Mar 2014, at 01:11, Scott Clasen <scott.clasen@gmail.com> wrote:
>> 
>> > Evgeniy Shishkin wrote
>> >> So, at the bottom — kafka input stream just does not work.
>> >
>> >
>> > That was the conclusion I was coming to as well.  Are there open tickets
>> > around fixing this up?
>> >
>> 
>> I am not aware of such. Actually nobody complained on spark+kafka before.
>> So i thought it just works, and then we tried to build something on it and almost
failed.
>> 
>> I think that it is possible to steal/replicate how twitter storm works with kafka.
>> They do manual partition assignment, at least this would help to balance load.
>> 
>> There is another issue.
>> ssc batch creates new rdds every batch duration, always, even it previous computation
did not finish.
>> 
>> But with kafka, we can consume more rdds later, after we finish previous rdds.
>> That way it would be much much simpler to not get OOM’ed when starting from beginning,
>> because we can consume many data from kafka during batch duration and then get oom.
>> 
>> But we just can not start slow, can not limit how many to consume during batch.
>> 
>> 
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> 
> 


Mime
View raw message