nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mohit" <mohit.j...@open-insights.co.in>
Subject RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.
Date Mon, 02 Apr 2018 14:26:43 GMT
Mark,

 

Error:- 

ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d] ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d]
failed to process due to org.apache.nifi.serialization.record.util.IllegalTypeConversionException:
Cannot convert value mohit of type class java.lang.String because no compatible types exist
in the UNION for field name; rolling back session: Cannot convert value mohit of type class
java.lang.String because no compatible types exist in the UNION for field name

 

I have a file with only one record :-  mohit,25

Just to check how it works, I’ve given incorrect schema: (int for string field)

{"type":"record","name":"test","namespace":"test","fields":[{"name":"name","type":["null","int"],"default":null},{"name":"age","type":["null","string"],"default":null}]}

 

It doesn’t pass the record to invalid relationship. But it keeps the file in the queue prior
to validateRecord processor.

 

Mohit

 

 

From: Mark Payne <markap14@hotmail.com> 
Sent: 02 April 2018 19:53
To: users@nifi.apache.org
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input.

 

What is the error that you're seeing? 

 





On Apr 2, 2018, at 10:22 AM, Mohit <mohit.jain@open-insights.co.in <mailto:mohit.jain@open-insights.co.in>
> wrote:

 

Hi Mark, 

 

I tried the ValidateRecord processor, it is converting the flowfile if it is valid. But If
the records are not valid, it is passing to the invalid relationship. Instead it keeps on
throwing bulletins keeping the flowfile in the queue.

 

Any suggestion?

 

Mohit

 

From: Mark Payne <markap14@hotmail.com <mailto:markap14@hotmail.com> > 
Sent: 02 April 2018 19:02
To: users@nifi.apache.org <mailto:users@nifi.apache.org> 
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input.

 

Mohit,

 

You can certainly dial back that number of Concurrent Tasks. Setting that to something like

10 is a pretty big number. Setting it to a thousand means that you'll likely starve out other

processors that are waiting on a thread and will generally perform a lot worse because you
have

1,000 different threads competing with each other to try to pull the next FlowFile.

 

You can use the ValidateRecord processor and configure a schema that indicates what you expect

the data to look like. Then you can route any invalid records to one route and valid records
to another

route. This will ensure that all data that goes to the 'valid' relationship is routed one
way and any other

data is routed to the 'invalid' relationship.

 

Thanks

-Mark

 

 






On Apr 2, 2018, at 9:22 AM, Mohit < <mailto:mohit.jain@open-insights.co.in> mohit.jain@open-insights.co.in>
wrote:

 

Hi Mark,

 

The main intention to use such flow is to track bad records. The records which doesn’t get
converted should be tracked somewhere. For that purpose I’m using Split-Merge approach.

 

Meanwhile, I’m able to improve the performance by increasing the ‘Concurrent Tasks’
to 1000.  Now ConvertCSVToAvro is able to convert 6-7k per second, which though not optimum
but quite better than 45-50 records per seconds. 

 

Is there any other improvement I can do?

 

Mohit

 

From: Mark Payne < <mailto:markap14@hotmail.com> markap14@hotmail.com> 
Sent: 02 April 2018 18:30
To:  <mailto:users@nifi.apache.org> users@nifi.apache.org
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record as an input.

 

Mohit, 

 

I agree that 45-50 records per second is quite slow. I'm not very familiar with the implementation
of

ConvertCSVToAvro, but it may well be that it must perform some sort of initialization for
each FlowFile

that it receives, which would explain why it's fast for a single incoming FlowFile and slow
for a large number.

 

Additionally, when you start splitting the data like that, you're generating a lot more FlowFiles,
which means

a lot more updates to both the FlowFile Repository and the Provenance Repository. As a result,
you're basically

taxing the NiFi framework far more than if you keep the data as a single FlowFile. On my laptop,
though, I would

expect more than 45-50 FlowFiles per second through most processors, but I don't know what
kind of hardware

you are running on.

 

In general, though, it is best to keep data together instead of splitting it apart. Since
the ConvertCSVToAvro can

handle many CSV records, is there a reason to split the data to begin with? Also, I would
recommend you look

at using the Record-based processors [1][2] such as ConvertRecord instead of the ConvertABCtoXYZ
processors, as

those are older processors and often don't work as well and the Record-oriented processors
often allow you to keep

data together as a single FlowFile throughout your entire flow, which makes the performance
far better and makes the

flow much easier to design.

 

Thanks

-Mark

 

 

 

[1]  <https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi> https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi

[2]  <https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries>
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries

 







On Apr 2, 2018, at 8:49 AM, Mohit < <mailto:mohit.jain@open-insights.co.in> mohit.jain@open-insights.co.in>
wrote:

 

Hi,

 

I’m trying to capture bad records from ConvertCSVToAvro processor. For that, I’m using
two SplitText processors in a row to create chunks and then each record per flow file.

 

My flow is  - ListFile -> FetchFile -> SplitText(10000 records) -> SplitText(1 record)
-> ConvertCSVToAvro -> *(futher processing)

 

I have a 10 MB file with 15 columns per row and 64000 records. Normal flow (without SplitText)
completes in few seconds. But when I’m using the above flow, ConvertCSVToAvro processor
works drastically slow(45-50 rec/sec).

I’m not able to conclude where I’m doing wrong in the flow. 

 

I’m using Nifi 1.5.0 .

 

Any quick input would be appreciated.

 

 

 

Thanks,

Mohit

 


Mime
View raw message