nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitaly Krivoy <Vitaly_Kri...@jhancock.com>
Subject RE: Ingestion from databases: pure NiFi vs Kylo with Scoop
Date Mon, 06 Aug 2018 16:11:45 GMT
Boris,

Thank you for your feedback.
I think I can now answer my own question after dipping a little deeper into Kylo’s documentation.
While you probably don’t have to deploy your infrastructure this way, Kylo documentation
implies that NiFi is deployed on the edge node. Given recommended reliance for data ingestion
on Kylo’s own GetTableData custom processor instead of combination of GenerateTableFetch
and QueryDatabaseTable processors deployed on a stand-alone NiFi node and a NiFi cluster,
this guarantees an inferior performance relatively to Sparc-driven Scoop based ingestion.
Why documentation doesn’t discuss the relative advantages of different possible  topologies
is another matter.

From: Boris Tyukin <boris@boristyukin.com>
Sent: Saturday, August 04, 2018 5:06 PM
To: users@nifi.apache.org
Subject: Re: Ingestion from databases: pure NiFi vs Kylo with Scoop

Vitaly,

The best way is to try yourself and build a simple process to prove your case.

I got excited first about Kylo, but quickly realized I could do everything I needed with NiFi.
I did not really care about fancy UI with Kylo, but I did love a lots of things - integration
with Spark and sqoop, template s for pipelines, centralized monitoring etc. But at the same
time, it is someone else's product, lagging behind nifi, with tons of other dependencies and
packages, built by that company.

I do believe you don't have to use sqoop if you don't want it - you can build your own templates
in Kylo which would be just a nifi flow with parameters and use JDBC SQL processors instead.

Now, you will be missing a lot of cool features of sqoop. One example, is direct database
connectors (Oracle for example). Much better performance. Changing timezones etc.

NiFi till recently could not ingest a table concurrently - with sqoop I can run 32 mappers
and it will break a table on 32 pieces and will ingest them to hdfs.

NiFi have a similar ability now but I think till NiFi 1.6, you had to use primary keys or
something like that. I think this has been improved recently and fetchdatabase processor can
do a lot like breaking a table on pieces and also support incremental loads.

Speaking of incrementals, I also wanted to build my own framework around incremental loads
with my own control table, audit and logging. I did not use sqoop incremental load feature
but some devs love it.

So if you do not care about all the cool sqoop features amd it's high performance, and just
need to ingest data, you will be fine using NiFi processors.


Boris

On Fri, Aug 3, 2018, 15:28 Vitaly Krivoy <Vitaly_Krivoy@jhancock.com<mailto:Vitaly_Krivoy@jhancock.com>>
wrote:
We are considering using Kylo on top of NiFi. It is my understanding that while Kylo manages
both NiFi and Spark, its designers decided to utilize Scoop from Spark in order to ingest
the data from relational databases. I am also aware that it is possible to drive Scoop from
NiFi using one of the processors which can run scripts. Why would Kylo designers rely on Scoop
rather than on NiFi? It’s possible to set up a stand-alone NiFi instance and a NiFi cluster
to do parallel database access. Scoop will achieve polarization for extraction from databases
relying on the power of MR. We are a HortonWorks on Azure shop, so we already have infrastructure
for both approaches. Does anyone have any feedback why would one approach be preferable to
another?

STATEMENT OF CONFIDENTIALITY The information contained in this email message and any attachments
may be confidential and legally privileged and is intended for the use of the addressee(s)
only. If you are not an intended recipient, please: (1) notify me immediately by replying
to this message; (2) do not use, disseminate, distribute or reproduce any part of the message
or any attachment; and (3) destroy all copies of this message and any attachments.

STATEMENT OF CONFIDENTIALITY The information contained in this email message and any attachments
may be confidential and legally privileged and is intended for the use of the addressee(s)
only. If you are not an intended recipient, please: (1) notify me immediately by replying
to this message; (2) do not use, disseminate, distribute or reproduce any part of the message
or any attachment; and (3) destroy all copies of this message and any attachments.
Mime
View raw message