spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeoffrey Lim <jeoffr...@gmail.com>
Subject Re: Maelstrom: Kafka integration with Spark
Date Wed, 24 Aug 2016 20:21:59 GMT
To clarify my earlier statement, I will continue working on Maelstrom
as an alternative to official Spark integration with Kafka and keep
the KafkaRDDs + Consumers as it is - until I find the official Spark Kafka
more stable and resilient to Kafka broker issues/failures (reason I have
infinite retry strategy on numerous places around Kafka related routines).

Not that i'm complaining or competing, at the end of the day having
a Spark App that continues to work overnight gives developer a good
sleep at night :)

On Thu, Aug 25, 2016 at 3:23 AM, Jeoffrey Lim <jeoffreyl@gmail.com> wrote:

> Hi Cody, thank you for pointing out sub-millisecond processing, it is
> an "exaggerated" term :D I simply got excited releasing this project, it
> should be: "millisecond stream processing at the spark level".
>
> Highly appreciate the info about latest Kafka consumer. Would need
> to get up to speed about the most recent improvements and new features
> of Kafka itself.
>
> I think with Spark's latest Kafka Integration 0.10 features, Maelstrom's
> upside would only be the simple APIs (developer friendly). I'll play
> around with Spark 2.0 kafka-010 KafkaRDD to see if this is feasible.
>
>
> On Wed, Aug 24, 2016 at 10:46 PM, Cody Koeninger <cody@koeninger.org>
> wrote:
>
>> Yes, spark-streaming-kafka-0-10 uses the new consumer.   Besides
>> pre-fetching messages, the big reason for that is that security
>> features are only available with the new consumer.
>>
>> The Kafka project is at release 0.10.0.1 now, they think most of the
>> issues with the new consumer have been ironed out.  You can track the
>> progress as to when they'll remove the "beta" label at
>> https://issues.apache.org/jira/browse/KAFKA-3283
>>
>> As far as I know, Kafka in general can't achieve sub-millisecond
>> end-to-end stream processing, so my guess is you need to be more
>> specific about your terms there.
>>
>> I promise I'm not trying to start a pissing contest :)  just wanted to
>> check if you were aware of the current state of the other consumers.
>> Collaboration is always welcome.
>>
>>
>> On Tue, Aug 23, 2016 at 10:18 PM, Jeoffrey Lim <jeoffreyl@gmail.com>
>> wrote:
>> > Apologies, I was not aware that Spark 2.0 has Kafka Consumer
>> caching/pooling
>> > now.
>> > What I have checked is the latest Kafka Consumer, and I believe it is
>> still
>> > in beta quality.
>> >
>> > https://kafka.apache.org/documentation.html#newconsumerconfigs
>> >
>> >> Since 0.9.0.0 we have been working on a replacement for our existing
>> >> simple and high-level consumers.
>> >> The code is considered beta quality.
>> >
>> > Not sure about this, does Spark 2.0 Kafka 0.10 integration already uses
>> this
>> > one? Is it now stable?
>> > With this caching feature in Spark 2,.0 could it achieve
>> sub-milliseconds
>> > stream processing now?
>> >
>> >
>> > Maelstrom still uses the old Kafka Simple Consumer, this library was
>> made
>> > open source so that I
>> > could continue working on it for future updates & improvements like
>> when the
>> > latest Kafka Consumer
>> > gets a stable release.
>> >
>> > We have been using Maelstrom "caching concept" for a long time now, as
>> > Receiver based Spark Kafka integration
>> > does not work for us. There were thoughts about using Direct Kafka APIs,
>> > however Maelstrom has
>> > very simple APIs and just "simply works" even under unstable scenarios
>> (e.g.
>> > advertised hostname failures on EMR).
>> >
>> > Maelstrom will work I believe even for Spark 1.3 and Kafka 0.8.2.1 (and
>> of
>> > course with the latest Kafka 0.10 as well)
>> >
>> >
>> > On Wed, Aug 24, 2016 at 9:49 AM, Cody Koeninger <cody@koeninger.org>
>> wrote:
>> >>
>> >> Were you aware that the spark 2.0 / kafka 0.10 integration also reuses
>> >> kafka consumer instances on the executors?
>> >>
>> >> On Tue, Aug 23, 2016 at 3:19 PM, Jeoffrey Lim <jeoffreyl@gmail.com>
>> wrote:
>> >> > Hi,
>> >> >
>> >> > I have released the first version of a new Kafka integration with
>> Spark
>> >> > that we use in the company I work for: open sourced and named
>> Maelstrom.
>> >> >
>> >> > It is unique compared to other solutions out there as it reuses the
>> >> > Kafka Consumer connection to achieve sub-milliseconds latency.
>> >> >
>> >> > This library has been running stable in production environment and
>> has
>> >> > been proven to be resilient to numerous production issues.
>> >> >
>> >> >
>> >> > Please check out the project's page in github:
>> >> >
>> >> > https://github.com/jeoffreylim/maelstrom
>> >> >
>> >> >
>> >> > Contributors welcome!
>> >> >
>> >> >
>> >> > Cheers!
>> >> >
>> >> > Jeoffrey Lim
>> >> >
>> >> >
>> >> > P.S. I am also looking for a job opportunity, please look me up at
>> >> > Linked In
>> >
>> >
>>
>
>

Mime
View raw message