nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Vishwakarma <shashi.vish...@gmail.com>
Subject Re: Spark Streaming with Nifi
Date Thu, 08 Jun 2017 19:32:50 GMT
Thanks Andrew.

Things are pretty clear now. At low level I need to write piece of java
code which will create json structure similar to Nifi provenance event and
will send it to another store .

I was under impression that Nifi has flexibility of updating provenance
store using API calls.

Thanks
Shashi


On Thu, Jun 8, 2017 at 12:37 PM, Andrew Psaltis <psaltis.andrew@gmail.com>
wrote:

> Hi Shashi,
> At this time you cannot write a provenance event into the NiFi provenance
> repository which is stored locally on the node that is processing the data.
> The repository is internal to NiFi, that is why I was suggesting create a "*Spark
> Provenance Event" *that you write to the same external store therefore
> you can have all the data in one place. However, the data coming from Spark
> will certainly be different. More information on the provenance repository
> usage can be found here [1] and the design here [2].
>
> Hope that helps.
>
> [1] https://nifi.apache.org/docs/nifi-docs/html/user-
> guide.html#data_provenance
> [2] https://cwiki.apache.org/confluence/display/NIFI/
> Persistent+Provenance+Repository+Design
>
> Thanks,
> Andrew
>
> On Thu, Jun 8, 2017 at 6:50 AM, Shashi Vishwakarma <
> shashi.vish123@gmail.com> wrote:
>
>> Hi Andrew,
>>
>> Regarding creating spark provenance event,
>>
>> *"let's call it a Spark Provenance Event -- in this you can populate as
>> much data as you have and write that to a similar data store."*
>>
>> Is there any way I can write my spark provenance event to Nifi provenance
>> store with some EventId ?
>>
>> I have ReportingTask which sends event to another application but it
>> relies on Nifi provenance store. I am thinking  that spark job will emit
>> provenance event which will be written in Nifi provenance store and
>> reporting task will send that to another application.
>>
>> Apologies if my use case is still unclear.
>>
>> Thanks
>> Shashi
>>
>>
>>
>> On Thu, Jun 8, 2017 at 3:23 AM, Andrew Psaltis <psaltis.andrew@gmail.com>
>> wrote:
>>
>>> Hi Shashi,
>>> Regarding your upgrade question, I may have confused things. When
>>> emitting a "provenance" event from your Spark Streaming job, this will not
>>> be the same exact event as that emitted from NiFi. I was referencing the
>>> code in the previous email to give insight into the details NiFi does
>>> provide. In your Spark application you will not have all of the information
>>> to populate a NiFi Provenance event. Therefore, for your Spark code you can
>>> come up with a new event, let's call it a Spark Provenance Event -- in this
>>> you can populate as much data as you have and write that to a similar data
>>> store. For example you would want a timestamp, the component can be Spark
>>> and any other data you need to emit. Basically, you will be combing the
>>> NiFi provenance data with your customer spark provenance data to create a
>>> complete picture.
>>>
>>> As far as the lineage goes, again your Spark streaming code will be
>>> executing outside of NiFi and you will have to write this into some other
>>> store, perhaps to Atlas and then you can have the lineage for both NiFi and
>>> Spark. This [1] is an example NiFi reporting tasks that sends lineage data
>>> to Atlas, you could extend this concept to work with Spark as well.
>>>
>>> Hopefully this helps clarify some things, sorry if my previous email was
>>> not completely clear.
>>>
>>> Thanks
>>> Andrew
>>>
>>> On Wed, Jun 7, 2017 at 4:31 PM, Shashi Vishwakarma <
>>> shashi.vish123@gmail.com> wrote:
>>>
>>>> Thanks a lot Andrew. This is something I was looking for.
>>>>
>>>> I have two question at point keeping in mind I have generate provenance
>>>> event.
>>>>
>>>> 1. How will I manage upgrade ? If I generate custom provenance and Nifi
>>>> community made significant changes in Nifi provenance structure ?
>>>>
>>>> 2. How do I get lineage information ?
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>>
>>>> On Tue, Jun 6, 2017 at 7:38 PM, Andrew Psaltis <
>>>> psaltis.andrew@gmail.com> wrote:
>>>>
>>>>> Hi Shashi,
>>>>> Your assumption is correct -- you would want to send a "provenance"
>>>>> event from your Spark job, you can see the structure of the provenance
>>>>> events in NiFi here [1]
>>>>>
>>>>> Regarding the flow, if you are waiting on the Spark Streaming code to
>>>>> compute some value before you continue you can construct it perhaps this
>>>>> way:
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>> Hopefully that helps to clarify it a little. In essence if you are
>>>>> waiting on results form the Spark Streaming computation before continuing
>>>>> you would use Kafka for the output results from Spark Streaming and then
>>>>> consume that in NiFi and carry on with your processing.
>>>>>
>>>>> [1] https://github.com/apache/nifi/blob/master/nifi-nar-bund
>>>>> les/nifi-framework-bundle/nifi-framework/nifi-client-dto/src
>>>>> /main/java/org/apache/nifi/web/api/dto/provenance/Provenance
>>>>> EventDTO.java
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <
>>>>> shashi.vish123@gmail.com> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> I am trying to understand here bit more in detail. Essentially I
will
>>>>>> have to write some custom code in my spark streaming job and construct
>>>>>> provenance event and send it to some store like Hbase,PubSub system
to be
>>>>>> consumed by others.
>>>>>>
>>>>>> Is that correct ?
>>>>>>
>>>>>> If yes how do I execute other processor which are present in pipeline
>>>>>> ?
>>>>>>
>>>>>> Ex
>>>>>>
>>>>>> Nifi --> Kakfa -- > Spark Streaming --> Processor 1 -->
Processor 2
>>>>>>
>>>>>> Thanks
>>>>>> Shashi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <
>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Shashi,
>>>>>>> Thanks for the explanation.  I have a better understanding of
what
>>>>>>> you are trying to accomplish. Although Spark streaming is micro-batch,
you
>>>>>>> would not want to keep launching jobs for each batch.   Think
of it as the
>>>>>>> Spark scheduler having a while loop in which it executes your
job then
>>>>>>> sleeps for X amount of time based on the interval you configure.
>>>>>>>
>>>>>>> Perhaps a better way would be to do the following:
>>>>>>> 1. Use the S2S ProvenanceReportingTask to send provenance
>>>>>>> information from your NiFi instance to a second instance or cluster.
>>>>>>> 2. In the second NiFi instance/cluster ( the one receiving the
>>>>>>> provenance data) you write the data into say HBase or Solr or
system X.
>>>>>>> 3. In your Spark streaming job you right into the same data store
a
>>>>>>> "provenance" event -- obviously this will not have all the fields
that a
>>>>>>> true NiFi provenance record does, but you can come close.
>>>>>>>
>>>>>>> With this then once you would then have all provenance data in
an
>>>>>>> external system that you can query to understand the whole system.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>> P.S. sorry if this is choppy or not well formed, on mobile.
>>>>>>>
>>>>>>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <
>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Andrew.
>>>>>>>>
>>>>>>>> I agree that decoupling component is good solution from long
term
>>>>>>>> perspective. My current data pipeline in Nifi is designed
for batch
>>>>>>>> processing which I am trying to convert into streaming model.
>>>>>>>>
>>>>>>>> One of the processor in data pipeline invokes Spark job ,
once job
>>>>>>>> finished control  is returned to Nifi processor in turn which
generates
>>>>>>>> provenance event for job. This provenance event is important
for us.
>>>>>>>>
>>>>>>>> Keeping batch model architecture in mind, I want to designed
spark
>>>>>>>> streaming based model in which Nifi Spark streaming processor
will process
>>>>>>>> micro batch and job status will be returned to Nifi with
provenance event.
>>>>>>>> Then I can capture that provenance data for my reports.
>>>>>>>>
>>>>>>>> Essentially I will be using Nifi for capturing provenance
event
>>>>>>>> where actual processing will be done by Spark streaming job.
>>>>>>>>
>>>>>>>> Do you see this approach logical ?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Shashi
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <
>>>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Shashi,
>>>>>>>>> I'm sure there is a way to make this work. However, my
first
>>>>>>>>> question is why you would want to? By design a Spark
Streaming application
>>>>>>>>> should always be running and consuming data from some
source, hence the
>>>>>>>>> notion of streaming. Tying Spark Streaming to NiFi would
ultimately result
>>>>>>>>> in a more coupled and fragile architecture. Perhaps a
different way to
>>>>>>>>> think about it would be to set things up like this:
>>>>>>>>>
>>>>>>>>> NiFi --> Kafka <-- Spark Streaming
>>>>>>>>>
>>>>>>>>> With this you can do what you are doing today -- using
NiFi to
>>>>>>>>> ingest, transform, make routing decisions, and feed data
into Kafka. In
>>>>>>>>> essence you would be using NiFi to do all the preparation
of the data for
>>>>>>>>> Spark Streaming. Kafka would serve the purpose of a buffer
between NiFi and
>>>>>>>>> Spark Streaming. Finally, Spark Streaming would ingest
data from Kafka and
>>>>>>>>> do what it is designed for -- stream processing. Having
a decoupled
>>>>>>>>> architecture like this also allows you to manage each
tier separately, thus
>>>>>>>>> you can tune, scale, develop, and deploy all separately.
>>>>>>>>>
>>>>>>>>> I know I did not directly answer your question on how
to make it
>>>>>>>>> work. But, hopefully this helps provide an approach that
will be a better
>>>>>>>>> long term solution. There may be something I am missing
in your initial
>>>>>>>>> questions.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma <
>>>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I am looking for way where I can make use of spark
streaming in
>>>>>>>>>> Nifi. I see couple of post where SiteToSite tcp connection
is used for
>>>>>>>>>> spark streaming application but I thinking it will
be good If I can launch
>>>>>>>>>> Spark streaming from Nifi custom processor.
>>>>>>>>>>
>>>>>>>>>> PublishKafka will publish message into Kafka followed
by Nifi
>>>>>>>>>> Spark streaming processor will read from Kafka Topic.
>>>>>>>>>>
>>>>>>>>>> I can launch Spark streaming application from custom
Nifi
>>>>>>>>>> processor using Spark Streaming launcher API but
biggest challenge is that
>>>>>>>>>> it will create spark streaming context for each flow
file which can be
>>>>>>>>>> costly operation.
>>>>>>>>>>
>>>>>>>>>> Does any one suggest storing spark streaming context
 in
>>>>>>>>>> controller service ? or any better approach for running
spark streaming
>>>>>>>>>> application with Nifi ?
>>>>>>>>>>
>>>>>>>>>> Thanks and Regards,
>>>>>>>>>> Shashi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>>>> twiiter: @itmdata
>>>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Andrew
>>>>>>>
>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>> twiiter: @itmdata
>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Andrew
>>>
>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>
>>
>>
>
>
> --
> Thanks,
> Andrew
>
> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>

Mime
View raw message