spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aries <aries.ko...@gmail.com>
Subject Re: KafkaInputDStream mapping of partitions to tasks
Date Mon, 05 May 2014 02:03:09 GMT
Hi,
	Has anyone work on this?

Best
Aries

在 2014年3月30日,3:22,Nicolas Bär <nicolas.baer@gmail.com> 写道:

> Hi
> 
> Is there any workaround to this problem?
> 
> I'm trying to implement a KafkaReceiver using the SimpleConsumer API [1] of Kafka and
handle the partition assignment manually. The easiest setup in this case would be to bind
the number of parallel jobs to the number of partitions in Kafka. This is basically what Samza
[2] does. I have a few questions regarding this implementation:
> 
> The getReceiver method looks like a good starting point to implement the manual partition
assignment. Unfortunately it lacks documentation. As far as I understood from the API, the
getReceiver method is called once and passes the received object (implementing the NetworkReceiver
contract) to the worker nodes. Therefore the logic to assign partitions to each receiver has
to be implemented within the receiver itself. I'm planning to implement the following and
have some questions in this regard:
> 1. within getReceiver: setup a zookeeper queue with the assignment of partitions to parallel
jobs and store the number of consumers needed
> - Is the number of parallel jobs accessible through ssc?
> 2. within onStart: poll the zookeeper queue and receive the partition number(s) to receive
data from
> 3. within onStart: start a new thread for keepalive messages to zookeeper. In case a
node goes down and a new receiver is started up again, the new receiver can lookup zookeper
to find the consumer without recent keepalives and take it's place. Due to SPARK-1340 this
might not even be possible yet. Is there any insight on why the receiver is not started again?
> 
> Do you think the use of zookeeper in this scenario is a good approach or is there an
easier way using an integrated spark functionality? Also, by using the SimpleConsumer API
one has to manage the offset per consumer. The high level consumer API solves this problem
with zookeeper. I'd take the same approach, if there's no easier way to handle this in spark.
> 
> 
> Best
> Nicolas
> 
> [1] https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
> [2] http://samza.incubator.apache.org/learn/documentation/0.7.0/container/task-runner.html
> 
> 
> On Fri, Mar 28, 2014 at 2:36 PM, Evgeniy Shishkin <itparanoia@gmail.com> wrote:
> One more question,
> 
> we are using memory_and_disk_ser_2
> and i worried when those rdds on disk will be removed
> http://i.imgur.com/dbq5T6i.png
> 
> unpersist is set to true, and rdds get purged from memory, but disk space just keep growing.
> 
> On 28 Mar 2014, at 01:32, Tathagata Das <tathagata.das1565@gmail.com> wrote:
> 
> > Yes, no one has reported this issue before. I just opened a JIRA on what I think
is the main problem here
> > https://spark-project.atlassian.net/browse/SPARK-1340
> > Some of the receivers dont get restarted.
> > I have a bunch refactoring in the NetworkReceiver ready to be posted as a PR that
should fix this.
> >
> > Regarding the second problem, I have been thinking of adding flow control (i.e.
limiting the rate of receiving) for a while, just havent gotten around to it.
> > I added another JIRA for that for tracking this issue.
> > https://spark-project.atlassian.net/browse/SPARK-1341
> >
> >
> > TD
> >
> >
> > On Thu, Mar 27, 2014 at 3:23 PM, Evgeny Shishkin <itparanoia@gmail.com> wrote:
> >
> > On 28 Mar 2014, at 01:11, Scott Clasen <scott.clasen@gmail.com> wrote:
> >
> > > Evgeniy Shishkin wrote
> > >> So, at the bottom — kafka input stream just does not work.
> > >
> > >
> > > That was the conclusion I was coming to as well.  Are there open tickets
> > > around fixing this up?
> > >
> >
> > I am not aware of such. Actually nobody complained on spark+kafka before.
> > So i thought it just works, and then we tried to build something on it and almost
failed.
> >
> > I think that it is possible to steal/replicate how twitter storm works with kafka.
> > They do manual partition assignment, at least this would help to balance load.
> >
> > There is another issue.
> > ssc batch creates new rdds every batch duration, always, even it previous computation
did not finish.
> >
> > But with kafka, we can consume more rdds later, after we finish previous rdds.
> > That way it would be much much simpler to not get OOM’ed when starting from beginning,
> > because we can consume many data from kafka during batch duration and then get oom.
> >
> > But we just can not start slow, can not limit how many to consume during batch.
> >
> >
> > >
> > > --
> > > View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html
> > > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
> 
> 


Mime
View raw message