nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lee Laim <lee.l...@gmail.com>
Subject Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?
Date Wed, 26 Oct 2016 13:58:06 GMT
Prabhu,

I attached a template to illustrate using 4 ExecuteStreamCommand(ESC)
processors to wrap 'sed'.  In this case, ESC does not appear to have more
throughput than ExtractText.

On an i7 laptop with SSD and 16GB ram, I was able to process
950,000records/5 min through Extract Text (~100MB of records);
70,000 records/5min through the ESC processors;

Under testing with very heavy load, I did receive warnings about where the
flow would slow down because the provenance repo was not keeping up.

I check concurrent tasks as Kevin mentioned and monitor the
processor/disk/ram/jvm metrics closely.

Thanks,
Lee


On Tue, Oct 25, 2016 at 10:05 AM, Kevin Verhoeven <Kevin.Verhoeven@ds-iq.com
> wrote:

> Prabhu,
>
>
>
> How many Concurrent Tasks have you set on ExtractText? You might try
> increasing Concurrent Tasks from 1 to 4, for example, and see if throughput
> improves. Depending on the number you increase you may have to decrease
> somewhere else.
>
>
>
> Kevin
>
>
>
> *From:* prabhu Mahendran [mailto:prabhuu161994@gmail.com]
> *Sent:* Tuesday, October 25, 2016 2:22 AM
>
> *To:* users@nifi.apache.org
> *Subject:* Re: How to increase the processing speed of the ExtractText
> and ReplaceText Processor?
>
>
>
> Hi All,
>
> Thanks for your information.
>
> I have just found that bottleneck in extractText processor only. Not
> in PutSQL it works fine.
>
> Please look at below screenshot, ExtractText passes the queued size 40-50
> KB only. So it take more time for 23 MB data.
> [image: Inline image 1]
>
> I have tried in NIFI Cluster(one master and two slaves) it also take 30
> minutes for load 30 mb data into sql server.
>
> PutSQL can able to move the data quickly. ExtractText only take long time.
>
> In ExtractText I just only give regex but  execute stream needs command
> path and delimiter. How can I give regex in the executeStream command
> processor?.
> And one more I have to extract the data according to
> regex(eg:input:(.+)|(.+)|(.+)|(.+)) in ReplaceText processor(eg:
> input.1,input.2,input.3).
>
> I can able to perform bulk insertion in SQL Server by using GetFile(bulk
> insert query) into putSQL
>
> https://msdn.microsoft.com/en-us/library/ms175915.aspx
>
>
>
> but this bulk insertion doesn't depend on nifi it is run in SQL Server.
>
> I need to speed up extract text(bottleneck) processing.
>
>
>
>
>
>
>
>
> On Mon, Oct 24, 2016 at 6:46 PM, Lee Laim <lee.laim@gmail.com> wrote:
>
> Hello Prabhu,
>
>
>
> 50 minutes is a good start! Now we have to *determine where the next
> bottleneck is *-check to see where the flow files are queueing.  You can
> also check the "average task duration" statistic for each processor.  I
> suspect the bottleneck is at  PutSQL and will carry this assumption
> forward.
>
>
>
> There are several knobs you can adjust at the assumed PutSQL bottleneck:
>
> 1.  Increase the run duration and keep the PutSQL processor running for *2
> seconds* before releasing the thread.
> 2. Set F*ragmented Transactions to false.  *This removes constraints that
> take time to check.
> 3. Consider changing *batch size*, systematically and observe throughput
> changes. I'd move up in increments of 100.
> 4*. Increase the number of concurrent tasks for the bottleneck processor
> to 3 or higher.  Increase, systematically to observe if you get more flow
> files through.   You can increase the *max timer driven threads* of the
> NiFi instance in the *NiFi Flow Settings* (top right of the canvas).  you *can
> *set the max to 25, but you are truly limited by hardware here. Consider
> a more powerful system to manage this flow, especially with the time
> constraint you need. It is often easier to throw more hardware at the
> problem than to debug.
>
> Other areas:
>
> 5. On the output of the last SplitText processor,  Invoke * back pressure
> object threshold = 1*.  This will slow (temporarily stop) the first split
> text processor and reduce the number of overall flow files to manage.  It
> also reduces the NiFi processor demand for the cpu threads.
> 6. Increase nifi.queue.swap.threshold in nifi.properties-  reduce disk
> access.
> 7. Check connection/load on the SQL server.
>
>
>
> To address your queries,I used the same expression you provided:
> (.+)[|](.+)[|](.+)[|](.+)
>
> You *can* use an ExtractStreamCommand processor to 'extract text', but in
> this case, with small flow files, it won't offer much gain.
>
>
>
> *With an i5 processor, you have 4 cpu threads to process flow files,
> manage NiFi, read/write to disk, and handle all other non-NiFi processes.
> Moving to an i7 or Xeon, hyper threading will provide NiFi more resources
> to really get work done.  While clustering is great for increasing
> throughput, I wouldn't recommend clustering on a set of smaller i5 systems,
> as there is added communication overhead that will need to be handled by
> the cpu.  It can be done, but there are easier ways to increase throughput
> at this point in development.
>
>
>
> Hope this helps.  Also, if there is anything I stated that is contrary to
> what others have observed, please chime in.
>
>
>
> Thanks,
>
> Lee
>
>
>
>
>
> On Thu, Oct 20, 2016 at 6:02 PM, prabhu Mahendran <prabhuu161994@gmail.com>
> wrote:
>
> Lee,
>
> I have tried your suggested flow which can able to insert the data into
> sql server in 50 minutes And it also take long time.
>
> *==>*your Query:
> *You might be processing the entire dat file (instead of a single row) for
> each record. *
>   How can i process entire dat file into SQL Server?
>
>
> *==>Query:Without any new optimizations you'll need ~25 threads and
> sufficient memory to feed the threads. *
>   My processors runs in 10 threads only by setting concurrent threads,How
> to increase it to be 25 threads.
>
> If you try quick test then please share "what is regex which you have
> used?"
>
> Is there any other processor having functionality like extract text?
>
> Thanks
>
>
>
> On Wed, Oct 19, 2016 at 11:29 PM, Lee Laim <lee.laim@gmail.com> wrote:
>
> Prabu,
>
>
>
> In order to move 3M rows in 10 minutes, you'll need to process 5000
> rows/second.
>
> During your 4 hour run, you were processing ~200 rows/second.
>
>
>
> Without any new optimizations you'll need ~25 threads and sufficient
> memory to feed the threads.  I agree with Mark and you should be able to
> get far more than 200 rows/second.
>
>
>
> I ran a quick test using your ExtractText regex on similar data I was able
> to process over 100,000 rows/minute through the extract text processor.
> The input data was a single row of 4 fields delimited by the "|" symbol.
>
>
>
> *You might be processing the entire dat file (instead of a single row) for
> each record.*
>
> *Can you check the flow file attributes and content going into
> ExtractText?  *
>
>
>
>
>
> Here is the flow with some notes:
>
>
>
> 1.GetFile (a 30 MB .dat file consisting of 3M rows; each row is about 10
> bytes)
>
>
>
> 2 SplitText -> SplitText  (to break the 3M rows down to manageable chunks
> of 10,000 lines per flow file, then split again to 1 line per flow file)
>
>
>
> 3. ExtractText to extract the 4 fields
>
>
>
> 4. ReplaceText to generate json (You can alternatively use
> AttributesToJson here)
>
>
>
> 5. ConvertJSONtoSQL
>
>
>
> 6. PutSQL - (This should be true bottleneck; Index the DB well and use
> many threads)
>
>
>
> If my assumptions are incorrect, please let me know.
>
>
>
> Thanks,
>
> Lee
>
>
>
> On Thu, Oct 20, 2016 at 1:43 AM, Kevin Verhoeven <
> Kevin.Verhoeven@ds-iq.com> wrote:
>
> I’m not clear on how much data you are processing, does the data(.dat)
> file have 3,00,000 rows?
>
>
>
> Kevin
>
>
>
> *From:* prabhu Mahendran [mailto:prabhuu161994@gmail.com]
> *Sent:* Wednesday, October 19, 2016 2:05 AM
> *To:* users@nifi.apache.org
> *Subject:* Re: How to increase the processing speed of the ExtractText
> and ReplaceText Processor?
>
>
>
> Mark,
>
> Thanks for the response.
>
> My Sample input data(.dat) like below..,
>
> 1|2|3|4
> 6|7|8|9
> 11|12|13|14
>
> In Extract Text,i have add input row only with addition of default
> properties like below screenshot.
>
> [image: Inline image 1]
> In Replace text ,
>
> just replace value like {"data1":"${inputrow.1}","
> data2":"${inputrow.2}","data3":"${inputrow.3}","data4":"${inputrow.4}"}
> [image: Inline image 2]
>
>
> Here there is no bulletins indicates back pressure on processors.
>
> Can i know prerequisites needed for move the 3,00,000 data into sql server
> in duration 10-20 minutes?
> What are the number of CPU' s needed?
> How much heap size and perm gen size we need to set for move that data
> into sql server?
>
> Thanks
>
>
>
> On Tue, Oct 18, 2016 at 7:05 PM, Mark Payne <markap14@hotmail.com> wrote:
>
> Prabhu,
>
>
>
> Thanks for the details. All of this seems fairly normal. Given that you
> have only a single core,
>
> I don't think multiple concurrent tasks will help you. Can you share your
> configuration for ExtractText
>
> and ReplaceText? Depending on the regex'es being used, they can be
> extremely expensive to evaluate.
>
> The regex that you mentioned in the other email -
> "(.+)[|](.+)[|](.+)[|](.+)" is in fact extremely expensive.
>
> Any time that you have ".*" or ".+" in your regex, it is going to be
> extremely expensive, especially with
>
> longer FlowFile content.
>
>
>
> Also, do you see any bulletins indicating that the provenance repository
> is applying backpressure? Given
>
> that you are splitting your FlowFiles into individual lines, the
> provenance repository may be under a lot
>
> of pressure.
>
>
>
> Another thing to check, is how much garbage collection is occurring. This
> can certainly destroy your performance
>
> quickly. You can get this information by going to the "Summary Table" in
> the top-right of the UI and then clicking the
>
> "System Diagnostics" link in the bottom-right corner of that Summary Table.
>
>
>
> Thanks
>
> -Mark
>
>
>
>
>
> On Oct 18, 2016, at 1:31 AM, prabhu Mahendran <prabhuu161994@gmail.com>
> wrote:
>
>
>
> Mark,
>
> Thanks for your response.
>
> Please find the response for your questions.
>
> ==>The first processor that you see that exhibits poor performance is
> ExtractText, correct?
>                              Yes,Extract Text exhibits poor performance.
>
> ==>How big is your Java heap?
>                             I have set 1 GB for java heap.
>
> ==>Do you have back pressure configured on the connection between
> ExtractText and ReplaceText?
>                            There is no back pressure between extract and
> replace text.
>
> ==>when you say that you specify concurrent tasks, what are you
> configuring the concurrent tasks
>
> to be?
>                           I have specify concurrent tasks to be 2 for the
> extract text processor due to slower processing rate.Which
>           is specified in Concurrent Task Text box.
>
> ==>Have you changed the maximum number of concurrent tasks available to
> your dataflow?
>                          No i haven't changed.
>
> ==>How many CPU's are available on this machine?
>                         Only single cpu are available in this machine with
> core i5 processor CPU @2.20Ghz.
>
> ==> Are these the only processors in your flow, or do you have other
> dataflows going on in the
>
> same instance as NiFi?
>                        Yes this is the only processor in work flow which
> is running and no other instances are running.
>
> Thanks
>
>
>
> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne <markap14@hotmail.com> wrote:
>
> Prabhu,
>
>
>
> Certainly, the performance that you are seeing, taking 4-5 hours to move
> 3M rows into SQLServer is far from
>
> ideal, but the good news is that it is also far from typical. You should
> be able to see far better results.
>
>
>
> To help us understand what is limiting the performance, and to make sure
> that we understand what you are seeing,
>
> I have a series of questions that would help us to understand what is
> going on.
>
>
>
> The first processor that you see that exhibits poor performance is
> ExtractText, correct?
>
> Can you share the configuration that you have for that processor?
>
>
>
> How big is your Java heap? This is configured in conf/bootstrap.conf; by
> default it is configured as:
>
> java.arg.2=-Xms512m
>
> java.arg.3=-Xmx512m
>
>
>
> Do you have backpressure configured on the connection between ExtractText
> and ReplaceText?
>
>
>
> Also, when you say that you specify concurrent tasks, what are you
> configuring the concurrent tasks
>
> to be? Have you changed the maximum number of concurrent tasks available
> to your dataflow? By default, NiFi will
>
> use only 10 threads max. How many CPU's are available on this machine?
>
>
>
> And finally, are these the only processors in your flow, or do you have
> other dataflows going on in the
>
> same instance as NiFi?
>
>
>
> Thanks
>
> -Mark
>
>
>
>
>
> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran <prabhuu161994@gmail.com>
> wrote:
>
>
>
> Hi All,
>
> I have tried to perform the below operation.
>
> dat file(input)-->JSON-->SQL-->SQLServer
>
>
> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->
> ConvertJsonToSQL-->PutSQL.
>
> My Input File(.dat)-->3,00,000 rows.
>
> *Objective:* Move the data from '.dat' file into SQLServer.
>
> I can able to Store the data in SQL Server by using combination of above
> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
>
> Combination of SplitText's perform data read quickly.But Extract Text
> takes long time to pass given data matches with user defined expression.If
> input comes 107 MB but it send outputs in KB size only even ReplaceText
> processor also processing data in KB Size only.
>
> In accordance with above slow processing leads the more time taken for
> data into SQLsever.
>
>
> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow
> file in Kilobytes only.
>
> If i have specify concurrent tasks for those ExtractText,ReplaceText,ConvertJsonToSQL
> then it occupy the 100% cpu and disk usage.
>
> It just 30 MB data ,But processors takes 6 hrs for data movement into
> SQLServer.
>
> Faced Problem is..,
>
>    1.        Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>    2.        ExtractText,ReplaceText take long time for processing
>    data(it send output flowfile kb size only).
>
> Can anyone help me to solve below *requirement*?
>
> Need to reduce the number of time taken by the processors for move the
> lakhs of data into SQL Server.
>
>
>
> If anything i'm done wrong,please help me to done it right.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message