nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Vishwakarma <shashi.vish...@gmail.com>
Subject Re: Spark Streaming with Nifi
Date Wed, 07 Jun 2017 20:31:09 GMT
Thanks a lot Andrew. This is something I was looking for.

I have two question at point keeping in mind I have generate provenance
event.

1. How will I manage upgrade ? If I generate custom provenance and Nifi
community made significant changes in Nifi provenance structure ?

2. How do I get lineage information ?

Thanks
Shashi


On Tue, Jun 6, 2017 at 7:38 PM, Andrew Psaltis <psaltis.andrew@gmail.com>
wrote:

> Hi Shashi,
> Your assumption is correct -- you would want to send a "provenance" event
> from your Spark job, you can see the structure of the provenance events in
> NiFi here [1]
>
> Regarding the flow, if you are waiting on the Spark Streaming code to
> compute some value before you continue you can construct it perhaps this
> way:
>
>
> [image: Inline image 1]
>
> Hopefully that helps to clarify it a little. In essence if you are waiting
> on results form the Spark Streaming computation before continuing you would
> use Kafka for the output results from Spark Streaming and then consume that
> in NiFi and carry on with your processing.
>
> [1] https://github.com/apache/nifi/blob/master/nifi-nar-
> bundles/nifi-framework-bundle/nifi-framework/nifi-client-
> dto/src/main/java/org/apache/nifi/web/api/dto/provenance/
> ProvenanceEventDTO.java
>
> Thanks,
> Andrew
>
> On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <
> shashi.vish123@gmail.com> wrote:
>
>> Hi Andrew,
>>
>> I am trying to understand here bit more in detail. Essentially I will
>> have to write some custom code in my spark streaming job and construct
>> provenance event and send it to some store like Hbase,PubSub system to be
>> consumed by others.
>>
>> Is that correct ?
>>
>> If yes how do I execute other processor which are present in pipeline ?
>>
>> Ex
>>
>> Nifi --> Kakfa -- > Spark Streaming --> Processor 1 --> Processor 2
>>
>> Thanks
>> Shashi
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <psaltis.andrew@gmail.com
>> > wrote:
>>
>>> Hi Shashi,
>>> Thanks for the explanation.  I have a better understanding of what you
>>> are trying to accomplish. Although Spark streaming is micro-batch, you
>>> would not want to keep launching jobs for each batch.   Think of it as the
>>> Spark scheduler having a while loop in which it executes your job then
>>> sleeps for X amount of time based on the interval you configure.
>>>
>>> Perhaps a better way would be to do the following:
>>> 1. Use the S2S ProvenanceReportingTask to send provenance information
>>> from your NiFi instance to a second instance or cluster.
>>> 2. In the second NiFi instance/cluster ( the one receiving the
>>> provenance data) you write the data into say HBase or Solr or system X.
>>> 3. In your Spark streaming job you right into the same data store a
>>> "provenance" event -- obviously this will not have all the fields that a
>>> true NiFi provenance record does, but you can come close.
>>>
>>> With this then once you would then have all provenance data in an
>>> external system that you can query to understand the whole system.
>>>
>>> Thanks,
>>> Andrew
>>>
>>> P.S. sorry if this is choppy or not well formed, on mobile.
>>>
>>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <
>>> shashi.vish123@gmail.com> wrote:
>>>
>>>> Thanks Andrew.
>>>>
>>>> I agree that decoupling component is good solution from long term
>>>> perspective. My current data pipeline in Nifi is designed for batch
>>>> processing which I am trying to convert into streaming model.
>>>>
>>>> One of the processor in data pipeline invokes Spark job , once job
>>>> finished control  is returned to Nifi processor in turn which generates
>>>> provenance event for job. This provenance event is important for us.
>>>>
>>>> Keeping batch model architecture in mind, I want to designed spark
>>>> streaming based model in which Nifi Spark streaming processor will process
>>>> micro batch and job status will be returned to Nifi with provenance event.
>>>> Then I can capture that provenance data for my reports.
>>>>
>>>> Essentially I will be using Nifi for capturing provenance event where
>>>> actual processing will be done by Spark streaming job.
>>>>
>>>> Do you see this approach logical ?
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>>
>>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <
>>>> psaltis.andrew@gmail.com> wrote:
>>>>
>>>>> Hi Shashi,
>>>>> I'm sure there is a way to make this work. However, my first question
>>>>> is why you would want to? By design a Spark Streaming application should
>>>>> always be running and consuming data from some source, hence the notion
of
>>>>> streaming. Tying Spark Streaming to NiFi would ultimately result in a
more
>>>>> coupled and fragile architecture. Perhaps a different way to think about
it
>>>>> would be to set things up like this:
>>>>>
>>>>> NiFi --> Kafka <-- Spark Streaming
>>>>>
>>>>> With this you can do what you are doing today -- using NiFi to ingest,
>>>>> transform, make routing decisions, and feed data into Kafka. In essence
you
>>>>> would be using NiFi to do all the preparation of the data for Spark
>>>>> Streaming. Kafka would serve the purpose of a buffer between NiFi and
Spark
>>>>> Streaming. Finally, Spark Streaming would ingest data from Kafka and
do
>>>>> what it is designed for -- stream processing. Having a decoupled
>>>>> architecture like this also allows you to manage each tier separately,
thus
>>>>> you can tune, scale, develop, and deploy all separately.
>>>>>
>>>>> I know I did not directly answer your question on how to make it work.
>>>>> But, hopefully this helps provide an approach that will be a better long
>>>>> term solution. There may be something I am missing in your initial
>>>>> questions.
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>>>> shashi.vish123@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I am looking for way where I can make use of spark streaming in Nifi.
>>>>>> I see couple of post where SiteToSite tcp connection is used for
spark
>>>>>> streaming application but I thinking it will be good If I can launch
Spark
>>>>>> streaming from Nifi custom processor.
>>>>>>
>>>>>> PublishKafka will publish message into Kafka followed by Nifi Spark
>>>>>> streaming processor will read from Kafka Topic.
>>>>>>
>>>>>> I can launch Spark streaming application from custom Nifi processor
>>>>>> using Spark Streaming launcher API but biggest challenge is that
it will
>>>>>> create spark streaming context for each flow file which can be costly
>>>>>> operation.
>>>>>>
>>>>>> Does any one suggest storing spark streaming context  in controller
>>>>>> service ? or any better approach for running spark streaming application
>>>>>> with Nifi ?
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Shashi
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>
>>>>
>>>> --
>>> Thanks,
>>> Andrew
>>>
>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>
>>
>>
>
>
> --
> Thanks,
> Andrew
>
> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>

Mime
View raw message