nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Psaltis <>
Subject Re: Spark Streaming with Nifi
Date Sun, 04 Jun 2017 23:36:47 GMT
Hi Shashi,
Thanks for the explanation.  I have a better understanding of what you are
trying to accomplish. Although Spark streaming is micro-batch, you would
not want to keep launching jobs for each batch.   Think of it as the Spark
scheduler having a while loop in which it executes your job then sleeps for
X amount of time based on the interval you configure.

Perhaps a better way would be to do the following:
1. Use the S2S ProvenanceReportingTask to send provenance information from
your NiFi instance to a second instance or cluster.
2. In the second NiFi instance/cluster ( the one receiving the provenance
data) you write the data into say HBase or Solr or system X.
3. In your Spark streaming job you right into the same data store a
"provenance" event -- obviously this will not have all the fields that a
true NiFi provenance record does, but you can come close.

With this then once you would then have all provenance data in an external
system that you can query to understand the whole system.


P.S. sorry if this is choppy or not well formed, on mobile.

On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <>

> Thanks Andrew.
> I agree that decoupling component is good solution from long term
> perspective. My current data pipeline in Nifi is designed for batch
> processing which I am trying to convert into streaming model.
> One of the processor in data pipeline invokes Spark job , once job
> finished control  is returned to Nifi processor in turn which generates
> provenance event for job. This provenance event is important for us.
> Keeping batch model architecture in mind, I want to designed spark
> streaming based model in which Nifi Spark streaming processor will process
> micro batch and job status will be returned to Nifi with provenance event.
> Then I can capture that provenance data for my reports.
> Essentially I will be using Nifi for capturing provenance event where
> actual processing will be done by Spark streaming job.
> Do you see this approach logical ?
> Thanks
> Shashi
> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <>
> wrote:
>> Hi Shashi,
>> I'm sure there is a way to make this work. However, my first question is
>> why you would want to? By design a Spark Streaming application should
>> always be running and consuming data from some source, hence the notion of
>> streaming. Tying Spark Streaming to NiFi would ultimately result in a more
>> coupled and fragile architecture. Perhaps a different way to think about it
>> would be to set things up like this:
>> NiFi --> Kafka <-- Spark Streaming
>> With this you can do what you are doing today -- using NiFi to ingest,
>> transform, make routing decisions, and feed data into Kafka. In essence you
>> would be using NiFi to do all the preparation of the data for Spark
>> Streaming. Kafka would serve the purpose of a buffer between NiFi and Spark
>> Streaming. Finally, Spark Streaming would ingest data from Kafka and do
>> what it is designed for -- stream processing. Having a decoupled
>> architecture like this also allows you to manage each tier separately, thus
>> you can tune, scale, develop, and deploy all separately.
>> I know I did not directly answer your question on how to make it work.
>> But, hopefully this helps provide an approach that will be a better long
>> term solution. There may be something I am missing in your initial
>> questions.
>> Thanks,
>> Andrew
>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>> wrote:
>>> Hi
>>> I am looking for way where I can make use of spark streaming in Nifi. I
>>> see couple of post where SiteToSite tcp connection is used for spark
>>> streaming application but I thinking it will be good If I can launch Spark
>>> streaming from Nifi custom processor.
>>> PublishKafka will publish message into Kafka followed by Nifi Spark
>>> streaming processor will read from Kafka Topic.
>>> I can launch Spark streaming application from custom Nifi processor
>>> using Spark Streaming launcher API but biggest challenge is that it will
>>> create spark streaming context for each flow file which can be costly
>>> operation.
>>> Does any one suggest storing spark streaming context  in controller
>>> service ? or any better approach for running spark streaming application
>>> with Nifi ?
>>> Thanks and Regards,
>>> Shashi
>> --
>> Thanks,
>> Andrew
>> Subscribe to my book: Streaming Data <>
>> <>
>> twiiter: @itmdata <>
> --

Subscribe to my book: Streaming Data <>
twiiter: @itmdata <>

View raw message