nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Venkat Williams <venkat.willi...@gmail.com>
Subject Re: Create a CovertCSVToJSON Processor
Date Tue, 06 Jun 2017 15:49:48 GMT
Thanks Mark for helping me to build a template and test Covert CSV to JSON
processing.

I want to know is it possible to emit transformed records as it is to next
processor rather than waiting for full file processing and keep the entire
result in single flowfile.

Input:
id,topic,hits
Rahul,scala,120
Nikita,spark,80
Mithun,spark,1
myself,cca175,180

Actual Output:
[{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","topic":"spark","hits":80},{"id":"Mithun","topic":"spark","hits":1},{"id":"myself","topic":"cca175","hits":180}]

Expected output:(multiple flow files like split result)
{"id":"Rahul","topic":"scala","hits":120}
{"id":"Nikita","topic":"spark","hits":80}
{"id":"Mithun","topic":"spark","hits":1}
{"id":"myself","topic":"cca175","hits":180}

By doing this I can overcome Heap/outofmemory issues which are so common.
(scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)

Regards,
Venkat

On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <markap14@hotmail.com> wrote:

> Hi Venkat,
>
> I just published a blog post [1] on running SQL in NiFi. The post walks
> through creating a CSV Record Reader,
> running SQL over the data, and then writing the results in JSON. This may
> be helpful to you. In your case,
> you may want to just use the ConvertRecord processor, rather than
> QueryRecord, but the concepts of creating
> the Record Reader and Writer are the same. This post references another
> post [2] that I wrote a week or two ago
> that gives a bit more details on how to actually create the reader and
> writer.
>
> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
> embedded newlines, escaped
> double-quotes, etc.
>
> I hope this helps give some direction in how to handle this in NiFi.
>
> Thanks
> -Mark
>
> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>
>
> On Jun 6, 2017, at 9:52 AM, Venkat Williams <venkat.williams@gmail.com>
> wrote:
>
> Hi Joe Witt,
>
> Thanks for your response.
>
> I heard and read about about these record readers but not quite got it how
> to use them using some test data or template. It will be great if you can
> help me to get some working example or flow.
>
> I want to know if these implementations support for RFC-4180
> <https://tools.ietf.org/html/rfc4180> formatted CSV files and be sure to
> handle edge cases like embedded newlines in a field value and escaped
> double quotes.
>
> Thanks for your help advance.
>
> Regards,
> Venkat
>
> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <joe.witt@gmail.com> wrote:
>
>> Venkat
>>
>> I think you'll want to take a closer look at the apache nifi 1.2.0
>> release support for record readers and record writers.  It handles
>> schema aware parsing/transformation and more for things like csv,
>> json, avro, can be easily extended, and supports scripted readers and
>> writers written right there through the UI.  As it is new examples are
>> still emerging but we can certainly help you along.
>>
>> Thanks
>> Joe
>>
>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
>> <venkat.williams@gmail.com> wrote:
>> > Hi
>> >
>> >
>> >
>> > I want to contribute this processor implementation code to NIFI project.
>> >
>> >
>> >
>> > Requirements:
>> >
>> >
>> >
>> > 1)     Convert CSV files to a standard/canonical JSON format
>> >
>> > a.       One JSON object/document per row in the input CSV
>> >
>> > b.      Format should encode the data as JSON fields and values
>> >
>> > c.       JSON Field names should be the original column header with any
>> > invalid characters handled properly.
>> >
>> > d.      Values should be kept unaltered
>> >
>> > 2)     Optionally, be able to specify an expected header used to
>> > validate/reject input CSVs
>> >
>> > 3)     Support both tab and comma delimited files
>> >
>> > a.     Auto-detect based on header row is easy
>> >
>> > b.    Allow operator to specify the delimiter as a way to override the
>> > auto-detect logic
>> >
>> > 4)     Handle arbitrarily large files...
>> >
>> > a.       should handle CSV files of any length ( achieve this using
>> > batching)
>> >
>> > 5)     Handle errors gracefully
>> >
>> > a.       File failures
>> >
>> > b.      Row failures
>> >
>> > 6)     Support for RFC-4180 formatted CSV files and be sure to handle
>> edge
>> > cases like embedded newlines in a field value and escaped double quotes
>> >
>> >
>> >
>> > Example:
>> >
>> > Input CSV:
>> >
>> > user,source_ip,source_country,destination_ip,url,timestamp
>> >
>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,201
>> 7-02-22T14:46:24-05:00
>> >
>> >
>> >
>> > Desired output JSON:
>> >
>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":
>> "IN","destination_ip":"23.246.97.82","url":"http://www.google.com
>> ","timestamp":"2017-02-22T14:46:24-05:00"}
>> >
>> >
>> >
>> > Implementation:
>> >
>> > 1)      Reviewed all the existing csv libraries which can be used to
>> > transform csv record to json document by supporting  RFC-4180 standard
>> to
>> > handle embedded new lines in field value and escaped quotes. Found
>> OpenCSV,
>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
>> >
>> > 2)      Selected Univocity CSV Library as I can do most of validations
>> which
>> > are part of my requirements only using this library. When I did the
>> > performance testing using 5 GB and 10GB arbitrarily large files this
>> gave
>> > better results compared any others.
>> >
>> > 3)      Processed CSV Records are being emitted immediately rather than
>> > waiting complete file processing. Used some configurable number in
>> processor
>> > to wait until that many records to emit. With this approach I could
>> process
>> > 5GB CSV data records using 1GB NIFI RAM which is most effective /
>> attractive
>> > feature in this whole implementation to handle large files. ( This is
>> common
>> > limitation in most of processors like SplitText, SplitXML, etc wait
>> until
>> > whole file processing and stores the results FlowFile ArrayList within
>> the
>> > processor this cause heap size/outofmemory issues)
>> >
>> > 4) Handled File errors and record errors gracefully using user defined
>> > configurations and processor routes.
>> >
>> > Can anyone suggest how to proceed further whether I have to open new
>> issue
>> > or if I have to use any existing issue. ( I don't find any which
>> matches to
>> > this requirement)
>>
>
>
>

Mime
View raw message