nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Psaltis <psaltis.and...@gmail.com>
Subject Re: Spark Streaming with Nifi
Date Thu, 08 Jun 2017 19:34:34 GMT
Ahh --- sorry if I had confused matters earlier. Feel free to reach out if
you get to a sticking point.

Thanks,
Andrew

On Thu, Jun 8, 2017 at 3:32 PM, Shashi Vishwakarma <shashi.vish123@gmail.com
> wrote:

> Thanks Andrew.
>
> Things are pretty clear now. At low level I need to write piece of java
> code which will create json structure similar to Nifi provenance event and
> will send it to another store .
>
> I was under impression that Nifi has flexibility of updating provenance
> store using API calls.
>
> Thanks
> Shashi
>
>
> On Thu, Jun 8, 2017 at 12:37 PM, Andrew Psaltis <psaltis.andrew@gmail.com>
> wrote:
>
>> Hi Shashi,
>> At this time you cannot write a provenance event into the NiFi provenance
>> repository which is stored locally on the node that is processing the data.
>> The repository is internal to NiFi, that is why I was suggesting create a "*Spark
>> Provenance Event" *that you write to the same external store therefore
>> you can have all the data in one place. However, the data coming from Spark
>> will certainly be different. More information on the provenance repository
>> usage can be found here [1] and the design here [2].
>>
>> Hope that helps.
>>
>> [1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.
>> html#data_provenance
>> [2] https://cwiki.apache.org/confluence/display/NIFI/Persist
>> ent+Provenance+Repository+Design
>>
>> Thanks,
>> Andrew
>>
>> On Thu, Jun 8, 2017 at 6:50 AM, Shashi Vishwakarma <
>> shashi.vish123@gmail.com> wrote:
>>
>>> Hi Andrew,
>>>
>>> Regarding creating spark provenance event,
>>>
>>> *"let's call it a Spark Provenance Event -- in this you can populate as
>>> much data as you have and write that to a similar data store."*
>>>
>>> Is there any way I can write my spark provenance event to Nifi
>>> provenance store with some EventId ?
>>>
>>> I have ReportingTask which sends event to another application but it
>>> relies on Nifi provenance store. I am thinking  that spark job will emit
>>> provenance event which will be written in Nifi provenance store and
>>> reporting task will send that to another application.
>>>
>>> Apologies if my use case is still unclear.
>>>
>>> Thanks
>>> Shashi
>>>
>>>
>>>
>>> On Thu, Jun 8, 2017 at 3:23 AM, Andrew Psaltis <psaltis.andrew@gmail.com
>>> > wrote:
>>>
>>>> Hi Shashi,
>>>> Regarding your upgrade question, I may have confused things. When
>>>> emitting a "provenance" event from your Spark Streaming job, this will not
>>>> be the same exact event as that emitted from NiFi. I was referencing the
>>>> code in the previous email to give insight into the details NiFi does
>>>> provide. In your Spark application you will not have all of the information
>>>> to populate a NiFi Provenance event. Therefore, for your Spark code you can
>>>> come up with a new event, let's call it a Spark Provenance Event -- in this
>>>> you can populate as much data as you have and write that to a similar data
>>>> store. For example you would want a timestamp, the component can be Spark
>>>> and any other data you need to emit. Basically, you will be combing the
>>>> NiFi provenance data with your customer spark provenance data to create a
>>>> complete picture.
>>>>
>>>> As far as the lineage goes, again your Spark streaming code will be
>>>> executing outside of NiFi and you will have to write this into some other
>>>> store, perhaps to Atlas and then you can have the lineage for both NiFi and
>>>> Spark. This [1] is an example NiFi reporting tasks that sends lineage data
>>>> to Atlas, you could extend this concept to work with Spark as well.
>>>>
>>>> Hopefully this helps clarify some things, sorry if my previous email
>>>> was not completely clear.
>>>>
>>>> Thanks
>>>> Andrew
>>>>
>>>> On Wed, Jun 7, 2017 at 4:31 PM, Shashi Vishwakarma <
>>>> shashi.vish123@gmail.com> wrote:
>>>>
>>>>> Thanks a lot Andrew. This is something I was looking for.
>>>>>
>>>>> I have two question at point keeping in mind I have generate
>>>>> provenance event.
>>>>>
>>>>> 1. How will I manage upgrade ? If I generate custom provenance and
>>>>> Nifi community made significant changes in Nifi provenance structure
?
>>>>>
>>>>> 2. How do I get lineage information ?
>>>>>
>>>>> Thanks
>>>>> Shashi
>>>>>
>>>>>
>>>>> On Tue, Jun 6, 2017 at 7:38 PM, Andrew Psaltis <
>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>
>>>>>> Hi Shashi,
>>>>>> Your assumption is correct -- you would want to send a "provenance"
>>>>>> event from your Spark job, you can see the structure of the provenance
>>>>>> events in NiFi here [1]
>>>>>>
>>>>>> Regarding the flow, if you are waiting on the Spark Streaming code
to
>>>>>> compute some value before you continue you can construct it perhaps
this
>>>>>> way:
>>>>>>
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> Hopefully that helps to clarify it a little. In essence if you are
>>>>>> waiting on results form the Spark Streaming computation before continuing
>>>>>> you would use Kafka for the output results from Spark Streaming and
then
>>>>>> consume that in NiFi and carry on with your processing.
>>>>>>
>>>>>> [1] https://github.com/apache/nifi/blob/master/nifi-nar-bund
>>>>>> les/nifi-framework-bundle/nifi-framework/nifi-client-dto/src
>>>>>> /main/java/org/apache/nifi/web/api/dto/provenance/Provenance
>>>>>> EventDTO.java
>>>>>>
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>> On Mon, Jun 5, 2017 at 9:46 AM, Shashi Vishwakarma <
>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> I am trying to understand here bit more in detail. Essentially
I
>>>>>>> will have to write some custom code in my spark streaming job
and construct
>>>>>>> provenance event and send it to some store like Hbase,PubSub
system to be
>>>>>>> consumed by others.
>>>>>>>
>>>>>>> Is that correct ?
>>>>>>>
>>>>>>> If yes how do I execute other processor which are present in
>>>>>>> pipeline ?
>>>>>>>
>>>>>>> Ex
>>>>>>>
>>>>>>> Nifi --> Kakfa -- > Spark Streaming --> Processor 1
--> Processor 2
>>>>>>>
>>>>>>> Thanks
>>>>>>> Shashi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 5, 2017 at 12:36 AM, Andrew Psaltis <
>>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Shashi,
>>>>>>>> Thanks for the explanation.  I have a better understanding
of what
>>>>>>>> you are trying to accomplish. Although Spark streaming is
micro-batch, you
>>>>>>>> would not want to keep launching jobs for each batch.   Think
of it as the
>>>>>>>> Spark scheduler having a while loop in which it executes
your job then
>>>>>>>> sleeps for X amount of time based on the interval you configure.
>>>>>>>>
>>>>>>>> Perhaps a better way would be to do the following:
>>>>>>>> 1. Use the S2S ProvenanceReportingTask to send provenance
>>>>>>>> information from your NiFi instance to a second instance
or cluster.
>>>>>>>> 2. In the second NiFi instance/cluster ( the one receiving
the
>>>>>>>> provenance data) you write the data into say HBase or Solr
or system X.
>>>>>>>> 3. In your Spark streaming job you right into the same data
store a
>>>>>>>> "provenance" event -- obviously this will not have all the
fields that a
>>>>>>>> true NiFi provenance record does, but you can come close.
>>>>>>>>
>>>>>>>> With this then once you would then have all provenance data
in an
>>>>>>>> external system that you can query to understand the whole
system.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> P.S. sorry if this is choppy or not well formed, on mobile.
>>>>>>>>
>>>>>>>> On Sun, Jun 4, 2017 at 17:46 Shashi Vishwakarma <
>>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Andrew.
>>>>>>>>>
>>>>>>>>> I agree that decoupling component is good solution from
long term
>>>>>>>>> perspective. My current data pipeline in Nifi is designed
for batch
>>>>>>>>> processing which I am trying to convert into streaming
model.
>>>>>>>>>
>>>>>>>>> One of the processor in data pipeline invokes Spark job
, once job
>>>>>>>>> finished control  is returned to Nifi processor in turn
which generates
>>>>>>>>> provenance event for job. This provenance event is important
for us.
>>>>>>>>>
>>>>>>>>> Keeping batch model architecture in mind, I want to designed
spark
>>>>>>>>> streaming based model in which Nifi Spark streaming processor
will process
>>>>>>>>> micro batch and job status will be returned to Nifi with
provenance event.
>>>>>>>>> Then I can capture that provenance data for my reports.
>>>>>>>>>
>>>>>>>>> Essentially I will be using Nifi for capturing provenance
event
>>>>>>>>> where actual processing will be done by Spark streaming
job.
>>>>>>>>>
>>>>>>>>> Do you see this approach logical ?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Shashi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Jun 4, 2017 at 3:10 PM, Andrew Psaltis <
>>>>>>>>> psaltis.andrew@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Shashi,
>>>>>>>>>> I'm sure there is a way to make this work. However,
my first
>>>>>>>>>> question is why you would want to? By design a Spark
Streaming application
>>>>>>>>>> should always be running and consuming data from
some source, hence the
>>>>>>>>>> notion of streaming. Tying Spark Streaming to NiFi
would ultimately result
>>>>>>>>>> in a more coupled and fragile architecture. Perhaps
a different way to
>>>>>>>>>> think about it would be to set things up like this:
>>>>>>>>>>
>>>>>>>>>> NiFi --> Kafka <-- Spark Streaming
>>>>>>>>>>
>>>>>>>>>> With this you can do what you are doing today --
using NiFi to
>>>>>>>>>> ingest, transform, make routing decisions, and feed
data into Kafka. In
>>>>>>>>>> essence you would be using NiFi to do all the preparation
of the data for
>>>>>>>>>> Spark Streaming. Kafka would serve the purpose of
a buffer between NiFi and
>>>>>>>>>> Spark Streaming. Finally, Spark Streaming would ingest
data from Kafka and
>>>>>>>>>> do what it is designed for -- stream processing.
Having a decoupled
>>>>>>>>>> architecture like this also allows you to manage
each tier separately, thus
>>>>>>>>>> you can tune, scale, develop, and deploy all separately.
>>>>>>>>>>
>>>>>>>>>> I know I did not directly answer your question on
how to make it
>>>>>>>>>> work. But, hopefully this helps provide an approach
that will be a better
>>>>>>>>>> long term solution. There may be something I am missing
in your initial
>>>>>>>>>> questions.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Jun 3, 2017 at 10:43 PM, Shashi Vishwakarma
<
>>>>>>>>>> shashi.vish123@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> I am looking for way where I can make use of
spark streaming in
>>>>>>>>>>> Nifi. I see couple of post where SiteToSite tcp
connection is used for
>>>>>>>>>>> spark streaming application but I thinking it
will be good If I can launch
>>>>>>>>>>> Spark streaming from Nifi custom processor.
>>>>>>>>>>>
>>>>>>>>>>> PublishKafka will publish message into Kafka
followed by Nifi
>>>>>>>>>>> Spark streaming processor will read from Kafka
Topic.
>>>>>>>>>>>
>>>>>>>>>>> I can launch Spark streaming application from
custom Nifi
>>>>>>>>>>> processor using Spark Streaming launcher API
but biggest challenge is that
>>>>>>>>>>> it will create spark streaming context for each
flow file which can be
>>>>>>>>>>> costly operation.
>>>>>>>>>>>
>>>>>>>>>>> Does any one suggest storing spark streaming
context  in
>>>>>>>>>>> controller service ? or any better approach for
running spark streaming
>>>>>>>>>>> application with Nifi ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>> Shashi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks,
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>>>>> twiiter: @itmdata
>>>>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>>> twiiter: @itmdata
>>>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> Andrew
>>>>>>
>>>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>> twiiter: @itmdata
>>>>>> <http://twitter.com/intent/user?screen_name=itmdata>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Andrew
>>>>
>>>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> Andrew
>>
>> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>> twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
>>
>
>


-- 
Thanks,
Andrew

Subscribe to my book: Streaming Data <http://manning.com/psaltis>
<https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>

Mime
View raw message