nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Provenzano Nicolas <nicolas.provenz...@gfi.fr>
Subject RE: Nifi vs Sqoop
Date Thu, 10 Nov 2016 16:07:51 GMT
Hi Matt, 

It fully answers to my question. 

Thanks and regards,

Nicolas

-----Message d'origine-----
De : Matt Burgess [mailto:mattyb149@apache.org] 
Envoyé : jeudi 10 novembre 2016 15:32
À : users@nifi.apache.org
Objet : Re: Nifi vs Sqoop

Nicolas,

The Max Value Columns property of QueryDatabaseTable is the specification by which the processor
fetches only the new lines. In your case you would put "lastmodificationdate" as the Max Value
Column. The first time the processor is triggered, it will execute a "SELECT * from myTable"
and get all the rows (as it does not yet know about "new" vs "old" rows). Then for the Max
Value Column, it will keep track of the maximum value currently observed for that column.
The next time the processor is triggered, it will execute a "SELECT * FROM myTable WHERE lastModificationDate
> the_max_value_seen_so_far".
Thus only rows whose value for the Max Value Column is greater than the current maximum will
be returned. Then the maximum is again updated, and so on.

Does this answer your question(about QueryDatabaseTable)? If not please let me know.

If your source table is large and/or you'd like to parallelize the fetching of rows from the
table, consider the GenerateTableFetch processor [1] instead. Rather than _executing_ SQL
like QueryDatabaseTable does, GenerateTableFetch _generates_ SQL, and will generate a number
of flow files, each containing a SQL statement that grabs X rows from the table. If you supply
a Max Value Column here, it too will perform incremental fetch after the initial one. These
flow files can be distributed throughout your cluster (using a RemoteProcessGroup pointing
to the same cluster, and an Input Port to receive the flow files), creating a parallel distributed
fetch capability like Sqoop. From a scaling perspective, Sqoop uses MapReduce so it can scale
with the size of your Hadoop cluster.
GenerateTableFetch can scale to the size of your NiFi cluster. You might choose NiFi or Sqoop
based on the volume and velocity of your data.

Regards,
Matt

[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.GenerateTableFetch/index.html

On Wed, Nov 9, 2016 at 4:37 AM, Provenzano Nicolas <nicolas.provenzano@gfi.fr> wrote:
> Hi all,
>
>
>
> I have the following requirements :
>
>
>
> ·         I need to load at day 1 a full SQL table,
>
> ·         And then need to incrementally load new data (using capture data
> change mechanism).
>
>
>
> Initially, I was thinking using Sqoop to do it.
>
>
>
> Looking at Nifi and especially the QueryDatabaseTable processor, I’m 
> wondering if I could use Nifi instead.
>
>
>
> Has someone already compared both to do it and what were the outcomes ?
>
>
>
> I can’t see however how to configure the QueryDatabaseTable to handle 
> the new lines (for example, looking at a “lastmodificationdate” field 
> and taking only the lines for which lastModificationDate > lastRequestDate) ?
>
>
>
> Thanks in advance
>
>
>
> BR
>
>
>
> Nicolas
Mime
View raw message