spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praneeth Gayam <praneeth.ga...@gmail.com>
Subject Re: Chaining Spark Streaming Jobs
Date Fri, 08 Sep 2017 06:24:15 GMT
With file stream you will have to deal with the following

   1. The file(s) must not be changed once created. So if the files are
   being continuously appended, the new data will not be read. Refer
   <https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources>
   2. The files must be created in the dataDirectory by atomically *moving*
    or *renaming* them into the data directory.

Since the latency requirements for the second job in the chain is only a
few mins, you may have to end up creating a new file every few mins

You may want to consider Kafka as your intermediary store for building a
chain/DAG of streaming jobs

On Fri, Sep 8, 2017 at 9:45 AM, Sunita Arvind <sunitarvind@gmail.com> wrote:

> Thanks for your response Michael
> Will try it out.
>
> Regards
> Sunita
>
> On Wed, Aug 23, 2017 at 2:30 PM Michael Armbrust <michael@databricks.com>
> wrote:
>
>> If you use structured streaming and the file sink, you can have a
>> subsequent stream read using the file source.  This will maintain exactly
>> once processing even if there are hiccups or failures.
>>
>> On Mon, Aug 21, 2017 at 2:02 PM, Sunita Arvind <sunitarvind@gmail.com>
>> wrote:
>>
>>> Hello Spark Experts,
>>>
>>> I have a design question w.r.t Spark Streaming. I have a streaming job
>>> that consumes protocol buffer encoded real time logs from a Kafka cluster
>>> on premise. My spark application runs on EMR (aws) and persists data onto
>>> s3. Before I persist, I need to strip header and convert protobuffer to
>>> parquet (I use sparksql-scalapb to convert from Protobuff to
>>> Spark.sql.Row). I need to persist Raw logs as is. I can continue the
>>> enrichment on the same dataframe after persisting the raw data, however, in
>>> order to modularize I am planning to have a separate job which picks up the
>>> raw data and performs enrichment on it. Also,  I am trying to avoid all in
>>> 1 job as the enrichments could get project specific while raw data
>>> persistence stays customer/project agnostic.The enriched data is allowed to
>>> have some latency (few minutes)
>>>
>>> My challenge is, after persisting the raw data, how do I chain the next
>>> streaming job. The only way I can think of is -  job 1 (raw data)
>>> partitions on current date (YYYYMMDD) and within current date, the job 2
>>> (enrichment job) filters for records within 60s of current time and
>>> performs enrichment on it in 60s batches.
>>> Is this a good option? It seems to be error prone. When either of the
>>> jobs get delayed due to bursts or any error/exception this could lead to
>>> huge data losses and non-deterministic behavior . What are other
>>> alternatives to this?
>>>
>>> Appreciate any guidance in this regard.
>>>
>>> regards
>>> Sunita Koppar
>>>
>>
>>

Mime
View raw message