nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy LoPresto <alopre...@apache.org>
Subject Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?
Date Mon, 24 Oct 2016 18:45:33 GMT
Excellent information here. Thanks Lee and Peter.

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 24, 2016, at 6:57 AM, Peter Wicks (pwicks) <pwicks@micron.com> wrote:
> 
> Prabhu,
> 
> Lee mentioned making sure you have good indexes, but I would caution you on this point.
 If you have a unique constraint then SQL Server will build an index on this automatically,
but I would suggest dropping all other indexes that aren’t related to data integrity. Each
time SQL Server updates a column that is indexed it’s going to be updating that index also.
 This will add a lot of overhead.
> 
> You might be thinking that you need these indexes though for user queries. To work around
this I often see the use of a staging table. This table has no indexes beyond the absolute
minimum to ensure data integrity, and sometimes even these are removed and data integrity/duplicate
removal is handled through the use of SQL or a Stored Procedure.  A periodic job will move
all data from this staging table into a final table.  If you execute the copy and a truncate
in a single transaction it allows you to do this safely:
> 
> INSERT INTO “Some_Final_Table” SELECT * FROM “Staging_Table_With_Exact_Same_schema”;TRUNCATE
TABLE “Staging_Table_With_Exact_Same_schema”;
> 
> If you do it this way you can keep the indexes you need for user access while still allowing
maximum data throughput to SQL Server.
> 
> I’ve seen a lot of comments online about batch sizing around 500 being optimal, but
of course this will vary on the system configuration; both your NiFi server and the SQL Server.
> 
> I have had issues getting good performance out of PutSQL even with the above, I don’t
think this is the fault of the processor, but more due to the volume of data and JDBC batch
row processing not really being designed for this kind of volume. In my case I was trying
to push about 10M rows over a longer time period, but was still running into trouble. After
working on the issue for a while I found that a database specific loader was needed. I am
loading to Teradata, so I wrote up a Teradata FastLoad processor.  In your case the MS SQL
Server JDBC Driver includes a `SQLServerBulkCopy` loader, https://msdn.microsoft.com/en-us/library/mt221490%28v=sql.110%29.aspx
<https://msdn.microsoft.com/en-us/library/mt221490%28v=sql.110%29.aspx>.  Unfortunately,
this would require writing code either through a scripted processor, or as a whole new processor.
> 
> Since writing a custom processor may be more than you want to jump into right now you
should probably take a look at `bcp`.  I didn’t catch if you were on Windows or a Unix platform,
but if you are on Windows I’d check out the command line based Bulk Copy Program for MS
SQL:https://msdn.microsoft.com/en-us/library/ms162802.aspx <https://msdn.microsoft.com/en-us/library/ms162802.aspx>.
 Using this would allow you to prepare your data into an intermediary format, like CSV first,
then send it all at once through `bcp`.
> 
> 
> Thanks,
>   Peter Wicks
> 
> From: Lee Laim [mailto:lee.laim@gmail.com]
> Sent: Monday, October 24, 2016 7:17 AM
> To: users@nifi.apache.org
> Subject: Re: How to increase the processing speed of the ExtractText and ReplaceText
Processor?
> 
> Hello Prabhu,
> 
> 50 minutes is a good start! Now we have to determine where the next bottleneck is -check
to see where the flow files are queueing.  You can also check the "average task duration"
statistic for each processor.  I suspect the bottleneck is at  PutSQL and will carry this
assumption forward.
> 
> There are several knobs you can adjust at the assumed PutSQL bottleneck:
> 1.  Increase the run duration and keep the PutSQL processor running for 2 secondsbefore
releasing the thread.
> 2. Set Fragmented Transactions to false.  This removes constraints that take time to
check.
> 3. Consider changing batch size, systematically and observe throughput changes. I'd move
up in increments of 100.
> 4*. Increase the number of concurrent tasks for the bottleneck processor to 3 or higher.
 Increase, systematically to observe if you get more flow files through.   You can increase
the max timer driven threads of the NiFi instance in the NiFi Flow Settings (top right of
the canvas).  you can set the max to 25, but you are truly limited by hardware here. Consider
a more powerful system to manage this flow, especially with the time constraint you need.
It is often easier to throw more hardware at the problem than to debug.
> Other areas:
> 5. On the output of the last SplitText processor,  Invoke back pressure object threshold
= 1.  This will slow (temporarily stop) the first split text processor and reduce the number
of overall flow files to manage.  It also reduces the NiFi processor demand for the cpu threads.
> 6. Increase nifi.queue.swap.threshold in nifi.properties-  reduce disk access.
> 7. Check connection/load on the SQL server.
> 
> To address your queries,I used the same expression you provided: (.+)[|](.+)[|](.+)[|](.+)
> You can use an ExtractStreamCommand processor to 'extract text', but in this case, with
small flow files, it won't offer much gain.
> 
> *With an i5 processor, you have 4 cpu threads to process flow files, manage NiFi, read/write
to disk, and handle all other non-NiFi processes.  Moving to an i7 or Xeon, hyper threading
will provide NiFi more resources to really get work done.  While clustering is great for increasing
throughput, I wouldn't recommend clustering on a set of smaller i5 systems, as there is added
communication overhead that will need to be handled by the cpu.  It can be done, but there
are easier ways to increase throughput at this point in development.
> 
> Hope this helps.  Also, if there is anything I stated that is contrary to what others
have observed, please chime in.
> 
> Thanks,
> Lee
> 
> 
> On Thu, Oct 20, 2016 at 6:02 PM, prabhu Mahendran <prabhuu161994@gmail.com <mailto:prabhuu161994@gmail.com>>
wrote:
> Lee,
> 
> I have tried your suggested flow which can able to insert the data into sql server in
50 minutes And it also take long time.
> 
> ==>your Query:You might be processing the entire dat file (instead of a single row)
for each record.
> 
>   How can i process entire dat file into SQL Server?
> 
> ==>Query:Without any new optimizations you'll need ~25 threads and sufficient memory
to feed the threads.
> 
>   My processors runs in 10 threads only by setting concurrent threads,How to increase
it to be 25 threads.
> 
> If you try quick test then please share "what is regex which you have used?"
> 
> Is there any other processor having functionality like extract text?
> 
> Thanks
> 
> 
> 
> On Wed, Oct 19, 2016 at 11:29 PM, Lee Laim <lee.laim@gmail.com <mailto:lee.laim@gmail.com>>
wrote:
> Prabu,
> 
> In order to move 3M rows in 10 minutes, you'll need to process 5000 rows/second.
> During your 4 hour run, you were processing ~200 rows/second.
> 
> Without any new optimizations you'll need ~25 threads and sufficient memory to feed the
threads.  I agree with Mark and you should be able to get far more than 200 rows/second.
> 
> I ran a quick test using your ExtractText regex on similar data I was able to process
over 100,000 rows/minute through the extract text processor.  The input data was a single
row of 4 fields delimited by the "|" symbol.
> 
> You might be processing the entire dat file (instead of a single row) for each record.
> Can you check the flow file attributes and content going into ExtractText?
> 
> 
> Here is the flow with some notes:
> 
> 1.GetFile (a 30 MB .dat file consisting of 3M rows; each row is about 10 bytes)
> 
> 2 SplitText -> SplitText  (to break the 3M rows down to manageable chunks of 10,000
lines per flow file, then split again to 1 line per flow file)
> 
> 3. ExtractText to extract the 4 fields
> 
> 4. ReplaceText to generate json (You can alternatively use AttributesToJson here)
> 
> 5. ConvertJSONtoSQL
> 
> 6. PutSQL - (This should be true bottleneck; Index the DB well and use many threads)
> 
> If my assumptions are incorrect, please let me know.
> 
> Thanks,
> Lee
> 
> On Thu, Oct 20, 2016 at 1:43 AM, Kevin Verhoeven <Kevin.Verhoeven@ds-iq.com <mailto:Kevin.Verhoeven@ds-iq.com>>
wrote:
> I’m not clear on how much data you are processing, does the data(.dat) file have 3,00,000
rows?
> 
> Kevin
> 
> From: prabhu Mahendran [mailto:prabhuu161994@gmail.com <mailto:prabhuu161994@gmail.com>]
> Sent: Wednesday, October 19, 2016 2:05 AM
> To: users@nifi.apache.org <mailto:users@nifi.apache.org>
> Subject: Re: How to increase the processing speed of the ExtractText and ReplaceText
Processor?
> 
> Mark,
> 
> Thanks for the response.
> 
> My Sample input data(.dat) like below..,
> 
> 1|2|3|4
> 6|7|8|9
> 11|12|13|14
> 
> In Extract Text,i have add input row only with addition of default properties like below
screenshot.
> 
> <image001.png>
> In Replace text ,
> 
> just replace value like {"data1":"${inputrow.1}","data2":"${inputrow.2}","data3":"${inputrow.3}","data4":"${inputrow.4}"}
> <image002.png>
> 
> 
> Here there is no bulletins indicates back pressure on processors.
> 
> Can i know prerequisites needed for move the 3,00,000 data into sql server in duration
10-20 minutes?
> What are the number of CPU' s needed?
> How much heap size and perm gen size we need to set for move that data into sql server?
> 
> Thanks
> 
> 
> On Tue, Oct 18, 2016 at 7:05 PM, Mark Payne <markap14@hotmail.com <mailto:markap14@hotmail.com>>
wrote:
> Prabhu,
> 
> Thanks for the details. All of this seems fairly normal. Given that you have only a single
core,
> I don't think multiple concurrent tasks will help you. Can you share your configuration
for ExtractText
> and ReplaceText? Depending on the regex'es being used, they can be extremely expensive
to evaluate.
> The regex that you mentioned in the other email - "(.+)[|](.+)[|](.+)[|](.+)" is in fact
extremely expensive.
> Any time that you have ".*" or ".+" in your regex, it is going to be extremely expensive,
especially with
> longer FlowFile content.
> 
> Also, do you see any bulletins indicating that the provenance repository is applying
backpressure? Given
> that you are splitting your FlowFiles into individual lines, the provenance repository
may be under a lot
> of pressure.
> 
> Another thing to check, is how much garbage collection is occurring. This can certainly
destroy your performance
> quickly. You can get this information by going to the "Summary Table" in the top-right
of the UI and then clicking the
> "System Diagnostics" link in the bottom-right corner of that Summary Table.
> 
> Thanks
> -Mark
> 
> 
> On Oct 18, 2016, at 1:31 AM, prabhu Mahendran <prabhuu161994@gmail.com <mailto:prabhuu161994@gmail.com>>
wrote:
> 
> Mark,
> 
> Thanks for your response.
> 
> Please find the response for your questions.
> 
> ==>The first processor that you see that exhibits poor performance is ExtractText,
correct?
>                              Yes,Extract Text exhibits poor performance.
> 
> ==>How big is your Java heap?
>                             I have set 1 GB for java heap.
> 
> ==>Do you have back pressure configured on the connection between ExtractText and
ReplaceText?
>                            There is no back pressure between extract and replace text.
> 
> ==>when you say that you specify concurrent tasks, what are you configuring the concurrent
tasks
> to be?
>                           I have specify concurrent tasks to be 2 for the extract text
processor due to slower processing rate.Which                           is specified in Concurrent
Task Text box.
> 
> ==>Have you changed the maximum number of concurrent tasks available to your dataflow?
>                          No i haven't changed.
> 
> ==>How many CPU's are available on this machine?
>                         Only single cpu are available in this machine with core i5 processor
CPU @2.20Ghz.
> 
> ==> Are these the only processors in your flow, or do you have other dataflows going
on in the
> same instance as NiFi?
>                        Yes this is the only processor in work flow which is running and
no other instances are running.
> 
> Thanks
> 
> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne <markap14@hotmail.com <mailto:markap14@hotmail.com>>
wrote:
> Prabhu,
> 
> Certainly, the performance that you are seeing, taking 4-5 hours to move 3M rows into
SQLServer is far from
> ideal, but the good news is that it is also far from typical. You should be able to see
far better results.
> 
> To help us understand what is limiting the performance, and to make sure that we understand
what you are seeing,
> I have a series of questions that would help us to understand what is going on.
> 
> The first processor that you see that exhibits poor performance is ExtractText, correct?
> Can you share the configuration that you have for that processor?
> 
> How big is your Java heap? This is configured in conf/bootstrap.conf; by default it is
configured as:
> java.arg.2=-Xms512m
> java.arg.3=-Xmx512m
> 
> Do you have backpressure configured on the connection between ExtractText and ReplaceText?
> 
> Also, when you say that you specify concurrent tasks, what are you configuring the concurrent
tasks
> to be? Have you changed the maximum number of concurrent tasks available to your dataflow?
By default, NiFi will
> use only 10 threads max. How many CPU's are available on this machine?
> 
> And finally, are these the only processors in your flow, or do you have other dataflows
going on in the
> same instance as NiFi?
> 
> Thanks
> -Mark
> 
> 
> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran <prabhuu161994@gmail.com <mailto:prabhuu161994@gmail.com>>
wrote:
> 
> Hi All,
> 
> I have tried to perform the below operation.
> 
> dat file(input)-->JSON-->SQL-->SQLServer
> 
> 
> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL.
> 
> My Input File(.dat)-->3,00,000 rows.
> 
> Objective: Move the data from '.dat' file into SQLServer.
> 
> I can able to Store the data in SQL Server by using combination of above processors.But
it takes almost 4-5 hrs to move complete data into SQLServer.
> 
> Combination of SplitText's perform data read quickly.But Extract Text takes long time
to pass given data matches with user defined expression.If input comes 107 MB but it send
outputs in KB size only even ReplaceText processor also processing data in KB Size only.
> 
> In accordance with above slow processing leads the more time taken for data into SQLsever.
> 
> 
> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow file in Kilobytes
only.
> 
> If i have specify concurrent tasks for those ExtractText,ReplaceText,ConvertJsonToSQL
then it occupy the 100% cpu and disk usage.
> 
> It just 30 MB data ,But processors takes 6 hrs for data movement into SQLServer.
> 
> Faced Problem is..,
> 
>        Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>        ExtractText,ReplaceText take long time for processing data(it send output flowfile
kb size only).
> Can anyone help me to solve below requirement?
> 
> Need to reduce the number of time taken by the processors for move the lakhs of data
into SQL Server.
> 
> 
> 
> If anything i'm done wrong,please help me to done it right.
> 


Mime
View raw message