spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Bär <nicolas.b...@gmail.com>
Subject Re: KafkaInputDStream mapping of partitions to tasks
Date Sat, 29 Mar 2014 19:22:34 GMT
Hi

Is there any workaround to this problem?

I'm trying to implement a KafkaReceiver using the SimpleConsumer API [1] of
Kafka and handle the partition assignment manually. The easiest setup in
this case would be to bind the number of parallel jobs to the number of
partitions in Kafka. This is basically what Samza [2] does. I have a few
questions regarding this implementation:

The getReceiver method looks like a good starting point to implement the
manual partition assignment. Unfortunately it lacks documentation. As far
as I understood from the API, the getReceiver method is called once and
passes the received object (implementing the NetworkReceiver contract) to
the worker nodes. Therefore the logic to assign partitions to each receiver
has to be implemented within the receiver itself. I'm planning to implement
the following and have some questions in this regard:
1. within getReceiver: setup a zookeeper queue with the assignment of
partitions to parallel jobs and store the number of consumers needed
- Is the number of parallel jobs accessible through ssc?
2. within onStart: poll the zookeeper queue and receive the partition
number(s) to receive data from
3. within onStart: start a new thread for keepalive messages to zookeeper.
In case a node goes down and a new receiver is started up again, the new
receiver can lookup zookeper to find the consumer without recent keepalives
and take it's place. Due to SPARK-1340 this might not even be possible yet.
Is there any insight on why the receiver is not started again?

Do you think the use of zookeeper in this scenario is a good approach or is
there an easier way using an integrated spark functionality? Also, by using
the SimpleConsumer API one has to manage the offset per consumer. The high
level consumer API solves this problem with zookeeper. I'd take the same
approach, if there's no easier way to handle this in spark.


Best
Nicolas

[1]
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
[2]
http://samza.incubator.apache.org/learn/documentation/0.7.0/container/task-runner.html


On Fri, Mar 28, 2014 at 2:36 PM, Evgeniy Shishkin <itparanoia@gmail.com>wrote:

> One more question,
>
> we are using memory_and_disk_ser_2
> and i worried when those rdds on disk will be removed
> http://i.imgur.com/dbq5T6i.png
>
> unpersist is set to true, and rdds get purged from memory, but disk space
> just keep growing.
>
> On 28 Mar 2014, at 01:32, Tathagata Das <tathagata.das1565@gmail.com>
> wrote:
>
> > Yes, no one has reported this issue before. I just opened a JIRA on what
> I think is the main problem here
> > https://spark-project.atlassian.net/browse/SPARK-1340
> > Some of the receivers dont get restarted.
> > I have a bunch refactoring in the NetworkReceiver ready to be posted as
> a PR that should fix this.
> >
> > Regarding the second problem, I have been thinking of adding flow
> control (i.e. limiting the rate of receiving) for a while, just havent
> gotten around to it.
> > I added another JIRA for that for tracking this issue.
> > https://spark-project.atlassian.net/browse/SPARK-1341
> >
> >
> > TD
> >
> >
> > On Thu, Mar 27, 2014 at 3:23 PM, Evgeny Shishkin <itparanoia@gmail.com>
> wrote:
> >
> > On 28 Mar 2014, at 01:11, Scott Clasen <scott.clasen@gmail.com> wrote:
> >
> > > Evgeniy Shishkin wrote
> > >> So, at the bottom -- kafka input stream just does not work.
> > >
> > >
> > > That was the conclusion I was coming to as well.  Are there open
> tickets
> > > around fixing this up?
> > >
> >
> > I am not aware of such. Actually nobody complained on spark+kafka before.
> > So i thought it just works, and then we tried to build something on it
> and almost failed.
> >
> > I think that it is possible to steal/replicate how twitter storm works
> with kafka.
> > They do manual partition assignment, at least this would help to balance
> load.
> >
> > There is another issue.
> > ssc batch creates new rdds every batch duration, always, even it
> previous computation did not finish.
> >
> > But with kafka, we can consume more rdds later, after we finish previous
> rdds.
> > That way it would be much much simpler to not get OOM'ed when starting
> from beginning,
> > because we can consume many data from kafka during batch duration and
> then get oom.
> >
> > But we just can not start slow, can not limit how many to consume during
> batch.
> >
> >
> > >
> > > --
> > > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html
> > > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >
> >
>
>

Mime
View raw message