spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sunita Arvind <>
Subject Re: Chaining Spark Streaming Jobs
Date Fri, 08 Sep 2017 10:56:43 GMT
Thanks for your response Praneeth. We did consider Kafka however cost was
the only hold back factor as we might need a larger cluster and existing
cluster is on premise and my app is on cloud. So the same cluster cannot be
But I agree it does sound like a good alternative.


On Thu, Sep 7, 2017 at 11:24 PM Praneeth Gayam <>

> With file stream you will have to deal with the following
>    1. The file(s) must not be changed once created. So if the files are
>    being continuously appended, the new data will not be read. Refer
>    <>
>    2. The files must be created in the dataDirectory by atomically
>    *moving* or *renaming* them into the data directory.
> Since the latency requirements for the second job in the chain is only a
> few mins, you may have to end up creating a new file every few mins
> You may want to consider Kafka as your intermediary store for building a
> chain/DAG of streaming jobs
> On Fri, Sep 8, 2017 at 9:45 AM, Sunita Arvind <>
> wrote:
>> Thanks for your response Michael
>> Will try it out.
>> Regards
>> Sunita
>> On Wed, Aug 23, 2017 at 2:30 PM Michael Armbrust <>
>> wrote:
>>> If you use structured streaming and the file sink, you can have a
>>> subsequent stream read using the file source.  This will maintain exactly
>>> once processing even if there are hiccups or failures.
>>> On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind <>
>>> wrote:
>>>> Hello Spark Experts,
>>>> I have a design question w.r.t Spark Streaming. I have a streaming job
>>>> that consumes protocol buffer encoded real time logs from a Kafka cluster
>>>> on premise. My spark application runs on EMR (aws) and persists data onto
>>>> s3. Before I persist, I need to strip header and convert protobuffer to
>>>> parquet (I use sparksql-scalapb to convert from Protobuff to
>>>> Spark.sql.Row). I need to persist Raw logs as is. I can continue the
>>>> enrichment on the same dataframe after persisting the raw data, however,
>>>> order to modularize I am planning to have a separate job which picks up the
>>>> raw data and performs enrichment on it. Also,  I am trying to avoid all in
>>>> 1 job as the enrichments could get project specific while raw data
>>>> persistence stays customer/project agnostic.The enriched data is allowed
>>>> have some latency (few minutes)
>>>> My challenge is, after persisting the raw data, how do I chain the next
>>>> streaming job. The only way I can think of is -  job 1 (raw data)
>>>> partitions on current date (YYYYMMDD) and within current date, the job 2
>>>> (enrichment job) filters for records within 60s of current time and
>>>> performs enrichment on it in 60s batches.
>>>> Is this a good option? It seems to be error prone. When either of the
>>>> jobs get delayed due to bursts or any error/exception this could lead to
>>>> huge data losses and non-deterministic behavior . What are other
>>>> alternatives to this?
>>>> Appreciate any guidance in this regard.
>>>> regards
>>>> Sunita Koppar

View raw message