nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vos, Walter" <walter....@ns.nl>
Subject RE: List/Fetch pattern for QueryDatabaseTable
Date Fri, 17 Aug 2018 08:21:04 GMT
Hi Matt,

Thanks for that very thorough explanation. I'll be sure to present this to my dev's in order
to come to a way of working. When it comes to ListDatabaseTables, how can I use this when
I don't want to query all tables from the database? Let's say there's some 30 tables in the
database and I only want to query 8 of them. I realize there's a parameter called 'Table Name
Pattern' in the processor, but how do I put together a pattern when the tables I'm targeting
have very different names? Also, it seems a bit silly to open a database connections just
to list the specific tables I'm looking to query (which I already know the names of). Is there
a processor I can use where I can just fill out this list myself?

Perhaps these are very basic questions, by the way. Everybody here is learning how to work
with NiFi and from an admin perspective I'm trying to provide some guiding on best practices
for common flows.

Kind regards,

Walter

-----Oorspronkelijk bericht-----
Van: Matt Burgess [mailto:mattyb149@apache.org]
Verzonden: donderdag 16 augustus 2018 15:55
Aan: users@nifi.apache.org
Onderwerp: Re: List/Fetch pattern for QueryDatabaseTable

Walter,

If you're looking to distribute database fetching among a cluster, then GenerateTableFetch
is the right choice (over QueryDatabaseTable).
As of NiFi 1.2.0 (via NIFI-2881 [1]), GenerateTableFetch accepts incoming flow files, the
capability was added in response to exactly the use case you outlined, distributed fetch of
multiple tables via ListDatabaseTables. You still want your source processor to run on the
Primary Node Only, otherwise all nodes get the same source data and as you said, you end up
with duplicate data.

QueryDatabaseTable does not accept incoming connections, but you can use ExecuteSQL to actually
do the fetching.  To distributed fetching of tables among the cluster, I recommend the following
flow:

ListDatabaseTables (on Primary Node only) -> RPG -> Input Port -> GenerateTableFetch
-> ExecuteSQL

Each node on the cluster will get flow files for the various tables in the database, then
GenerateTableFetch will generate the SQL to fetch "pages" based on the Partition Size property,
then ExecuteSQL will execute the statements. You can use multiple concurrent tasks for ExecuteSQL
to perform the fetching concurrently for the SQL statements, and the RPG->Input port part
will let you do the tables in parallel.  If instead you want to fully distribute the SQL execution
(fetching various pages from various tables), you could move the
RPG->Input Port after GTF:

ListDatabaseTables (on Primary Node only) -> GenerateTableFetch -> RPG
-> Input Port -> ExecuteSQL

In this flow, the Primary Node will do all the work of generating all the SQL for all the
pages, then will distributed the SQL among the cluster. So each node may be grabbing different
pages from the same table, etc.  Depending on how much work it takes to generate the SQL,
this may not be as performant as the first flow.  Alternatively you can distribute the SQL
generation and the execution:

ListDatabaseTables (on Primary Node only) -> RPG -> Input Port -> GenerateTableFetch
-> RPG -> Input Port -> ExecuteSQL

This might be overkill but does "fully" parallelize the work. In addition, as mentioned, you
can set multiple concurrent tasks for ExecuteSQL (but not GenerateTableFetch) to achieve concurrency
for fetching. One thing to watch out for in all cases is the Max Connections property for
the DBCPConnectionPool. Each node will get its own pool, but depending on how much is going
through GTF and ExecuteSQL, you may run out of connections (which will slow your
throughput) or if Max Connections is high, you may exhaust all connections at the server,
just something to keep in mind when configuring the flow.

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-2881
On Thu, Aug 16, 2018 at 7:55 AM Vos, Walter <walter.vos@ns.nl> wrote:
>
> Hi,
>
>
>
> I’m trying to find a good strategy for distributing work among a cluster when we’re
fetching data from a database. My developers are currently doing GenerateTableFetch and executing
it only on the primary node because “otherwise we end up with duplicate data”. A little
googling on my end and I found out about the List/Fetch pattern. All the examples are for
SFTP though.
>
>
>
> I’m wondering what a good configuration might be if you’re looking to use this pattern
for fetching from a database. I’ve found GenerateTableFetch, and I can certainly use this,
but since we’re querying multiple tables (but not all tables in the DB!) I’m hoping to
use something like ListDatabaseTables before that, so that GenerateTableFetch can be done
on the whole cluster and then QueryDatabaseTable as well.
>
>
>
> So one option is Multiple GenerateTableFetch processors > Funnel > RPG // Input
port > QueryDatabaseTable. I’m wondering if there’s also a good way to go this route:
ListDatabaseTables > RPG // Input port > GenerateTableFetch > QueryDatabaseTable.
I want to distribute as much work as possible within the cluster.
>
>
>
> Kind regards,
>
>
>
> Walter
>
>
> ________________________________
>
> Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik door)
de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke informatie bevatten. Openbaarmaking,
vermenigvuldiging, verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en eventuele
bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde
bent, wordt u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de hoogte
te brengen en de e-mail (en eventuele bijlagen) te vernietigen.
>
> Informatie vennootschap

________________________________

Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor (gebruik door) de geadresseerde.
De e-mail kan persoonlijke of vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging,
verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan
derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde geadresseerde bent, wordt
u vriendelijk verzocht degene die de e-mail verzond hiervan direct op de hoogte te brengen
en de e-mail (en eventuele bijlagen) te vernietigen.

Informatie vennootschap<http://www.ns.nl/emaildisclaimer>
Mime
View raw message