spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeetendra Gangele <>
Subject Re: Deploying spark-streaming application on production
Date Thu, 01 Oct 2015 14:49:43 GMT
Ya Also I think I need to enable the checkpointing and rather then building
the lineage DAG need to store the RDD data into HDFS.

On 23 September 2015 at 01:04, Adrian Tanase <> wrote:

> btw I re-read the docs and I want to clarify that reliable receiver + WAL
> gives you at least once, not exactly once semantics.
> Sent from my iPhone
> On 21 Sep 2015, at 21:50, Adrian Tanase <> wrote:
> I'm wondering, isn't this the canonical use case for WAL + reliable
> receiver?
> As far as I know you can tune Mqtt server to wait for ack on messages (qos
> level 2?).
> With some support from the client libray you could achieve exactly once
> semantics on the read side, if you ack message only after writing it to
> WAL, correct?
> -adrian
> Sent from my iPhone
> On 21 Sep 2015, at 12:35, Petr Novak <> wrote:
> In short there is no direct support for it in Spark AFAIK. You will either
> manage it in MQTT or have to add another layer of indirection - either
> in-memory based (observable streams, in-mem db) or disk based (Kafka, hdfs
> files, db) which will keep you unprocessed events.
> Now realizing, there is support for backpressure in v1.5.0 but I don't
> know if it could be exploited aka I don't know if it is possible to
> decouple event reading into memory and actual processing code in Spark
> which could be swapped on the fly. Probably not without some custom built
> facility for it.
> Petr
> On Mon, Sep 21, 2015 at 11:26 AM, Petr Novak <> wrote:
>> I should read my posts at least once to avoid so many typos. Hopefully
>> you are brave enough to read through.
>> Petr
>> On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak <>
>> wrote:
>>> I think you would have to persist events somehow if you don't want to
>>> miss them. I don't see any other option there. Either in MQTT if it is
>>> supported there or routing them through Kafka.
>>> There is WriteAheadLog in Spark but you would have decouple stream MQTT
>>> reading and processing into 2 separate job so that you could upgrade the
>>> processing one assuming the reading one would be stable (without changes)
>>> across versions. But it is problematic because there is no easy way how to
>>> share DStreams between jobs - you would have develop your own facility for
>>> it.
>>> Alternatively the reading job could could save MQTT event in its the
>>> most raw form into files - to limit need to change code - and then the
>>> processing job would work on top of it using Spark streaming based on
>>> files. I this is inefficient and can get quite complex if you would like to
>>> make it reliable.
>>> Basically either MQTT supports prsistence (which I don't know) or there
>>> is Kafka for these use case.
>>> Another option would be I think to place observable streams in between
>>> MQTT and Spark streaming with bakcpressure as far as you could perform
>>> upgrade till buffers fills up.
>>> I'm sorry that it is not thought out well from my side, it is just a
>>> brainstorm but it might lead you somewhere.
>>> Regards,
>>> Petr
>>> On Mon, Sep 21, 2015 at 10:09 AM, Jeetendra Gangele <
>>>> wrote:
>>>> Hi All,
>>>> I have an spark streaming application with batch (10 ms) which is
>>>> reading the MQTT channel and dumping the data from MQTT to HDFS.
>>>> So suppose if I have to deploy new application jar(with changes in
>>>> spark streaming application) what is the best way to deploy, currently I
>>>> doing as below
>>>> 1.killing the running streaming app using yarn application -kill ID
>>>> 2. and then starting the application again
>>>> Problem with above approach is since we are not persisting the events
>>>> in MQTT we will miss the events for the period of deploy.
>>>> how to handle this case?
>>>> regards
>>>> jeeetndra

View raw message