nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Venkat Williams <venkat.willi...@gmail.com>
Subject Re: Create a CovertCSVToJSON Processor
Date Wed, 07 Jun 2017 01:04:23 GMT
Thanks Mark for the valuable inputs.

SplitRecord is way to handle multiline records and NIFI-3921 helps us to
avoid needing schema when we can use the CSV header row itself as schema.

Anyone working on the NIFI-3921 issue if not I can take it up.

Regards,
Venkat

On Tue, Jun 6, 2017 at 10:06 PM, Mark Payne <markap14@hotmail.com> wrote:

> Venkat,
>
> If you do need to split the data up, there is now a SplitRecord processor
> that you can use to accomplish that with the readers and writers.
> So that won't have problems with CSV fields that span multiple lines.
>
> Unfortunately at this time, the writer does require that a schema registry
> be used to designate the schema. For most cases, this is fairly
> easy to do, but it is a step that we should be able to skip all together.
> There already exists a JIRA [1] to update the readers/writers so that
> the Record Writer can just inherit the schema that is provided by the
> Record Reader. Once this has been done, the CSV Reader should
> be able to create the schema based on the CSV Header, and then pass that
> along to the record writer.
>
> Thanks
> -Mark
>
> [1] https://issues.apache.org/jira/browse/NIFI-3921
>
>
> On Jun 6, 2017, at 12:12 PM, Venkat Williams <venkat.williams@gmail.com>
> wrote:
>
> Hi Joe and Mark,
>
> Thanks a lot for your prompt response.
>
> I wasn't able to able consider SplitText because CSV Records field values
> can fall in to next line with embedded newlines, escaped
> double-quotes, etc. So I have rule out any logic related to Split.
>
> Another question is it possible to convert CSV data json without
> specifying any schema just by considering CSV file first row as header and
> build schema internally using the header. If I don't specify schema
> registry I am getting 'schema access strategy' is invalid.
>
> Thanks,
> Venkat
>
> On Tue, Jun 6, 2017 at 9:29 PM, Joe Witt <joe.witt@gmail.com> wrote:
>
>> Venkat,
>>
>> The only heap issues that could be consider common are if you're doing
>> 'SplitText' and trying to go from hundreds of thousands or millions of
>> lines files to a single line output in a single processor.  You can
>> easily overcome that by doing a two phase split where the first
>> processor cuts into say 1000 line chunks and the next one does single
>> line chunks.  That said, with this record approach it doesn't have
>> that problem at all so the only cause for memory issues there would be
>> if any single record is so large that it takes up all the memory which
>> doesn't appear likely for your examples.
>>
>> Thanks
>>
>> On Tue, Jun 6, 2017 at 11:49 AM, Venkat Williams
>> <venkat.williams@gmail.com> wrote:
>> > Thanks Mark for helping me to build a template and test Covert CSV to
>> JSON
>> > processing.
>> >
>> > I want to know is it possible to emit transformed records as it is to
>> next
>> > processor rather than waiting for full file processing and keep the
>> entire
>> > result in single flowfile.
>> >
>> > Input:
>> > id,topic,hits
>> > Rahul,scala,120
>> > Nikita,spark,80
>> > Mithun,spark,1
>> > myself,cca175,180
>> >
>> > Actual Output:
>> > [{"id":"Rahul","topic":"scala","hits":120},{"id":"Nikita","t
>> opic":"spark","hits":80},{"id":"Mithun","topic":"spark","hit
>> s":1},{"id":"myself","topic":"cca175","hits":180}]
>> >
>> > Expected output:(multiple flow files like split result)
>> > {"id":"Rahul","topic":"scala","hits":120}
>> > {"id":"Nikita","topic":"spark","hits":80}
>> > {"id":"Mithun","topic":"spark","hits":1}
>> > {"id":"myself","topic":"cca175","hits":180}
>> >
>> > By doing this I can overcome Heap/outofmemory issues which are so
>> common.
>> > (scenario. have limited NIFI 1 GB RAM want to process 5 GB input data)
>> >
>> > Regards,
>> > Venkat
>> >
>> > On Tue, Jun 6, 2017 at 8:32 PM, Mark Payne <markap14@hotmail.com>
>> wrote:
>> >>
>> >> Hi Venkat,
>> >>
>> >> I just published a blog post [1] on running SQL in NiFi. The post walks
>> >> through creating a CSV Record Reader,
>> >> running SQL over the data, and then writing the results in JSON. This
>> may
>> >> be helpful to you. In your case,
>> >> you may want to just use the ConvertRecord processor, rather than
>> >> QueryRecord, but the concepts of creating
>> >> the Record Reader and Writer are the same. This post references another
>> >> post [2] that I wrote a week or two ago
>> >> that gives a bit more details on how to actually create the reader and
>> >> writer.
>> >>
>> >> The CSV Reader uses Apache Commons CSV, so it will support RFC-4180,
>> >> embedded newlines, escaped
>> >> double-quotes, etc.
>> >>
>> >> I hope this helps give some direction in how to handle this in NiFi.
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >> [1] https://blogs.apache.org/nifi/entry/real-time-sql-on-event
>> >> [2] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>> >>
>> >>
>> >> On Jun 6, 2017, at 9:52 AM, Venkat Williams <venkat.williams@gmail.com
>> >
>> >> wrote:
>> >>
>> >> Hi Joe Witt,
>> >>
>> >> Thanks for your response.
>> >>
>> >> I heard and read about about these record readers but not quite got it
>> how
>> >> to use them using some test data or template. It will be great if you
>> can
>> >> help me to get some working example or flow.
>> >>
>> >> I want to know if these implementations support for RFC-4180 formatted
>> CSV
>> >> files and be sure to handle edge cases like embedded newlines in a
>> field
>> >> value and escaped double quotes.
>> >>
>> >> Thanks for your help advance.
>> >>
>> >> Regards,
>> >> Venkat
>> >>
>> >> On Tue, Jun 6, 2017 at 7:07 PM, Joe Witt <joe.witt@gmail.com> wrote:
>> >>>
>> >>> Venkat
>> >>>
>> >>> I think you'll want to take a closer look at the apache nifi 1.2.0
>> >>> release support for record readers and record writers.  It handles
>> >>> schema aware parsing/transformation and more for things like csv,
>> >>> json, avro, can be easily extended, and supports scripted readers and
>> >>> writers written right there through the UI.  As it is new examples are
>> >>> still emerging but we can certainly help you along.
>> >>>
>> >>> Thanks
>> >>> Joe
>> >>>
>> >>> On Tue, Jun 6, 2017 at 3:12 AM, Venkat Williams
>> >>> <venkat.williams@gmail.com> wrote:
>> >>> > Hi
>> >>> >
>> >>> >
>> >>> >
>> >>> > I want to contribute this processor implementation code to NIFI
>> >>> > project.
>> >>> >
>> >>> >
>> >>> >
>> >>> > Requirements:
>> >>> >
>> >>> >
>> >>> >
>> >>> > 1)     Convert CSV files to a standard/canonical JSON format
>> >>> >
>> >>> > a.       One JSON object/document per row in the input CSV
>> >>> >
>> >>> > b.      Format should encode the data as JSON fields and values
>> >>> >
>> >>> > c.       JSON Field names should be the original column header
with
>> any
>> >>> > invalid characters handled properly.
>> >>> >
>> >>> > d.      Values should be kept unaltered
>> >>> >
>> >>> > 2)     Optionally, be able to specify an expected header used to
>> >>> > validate/reject input CSVs
>> >>> >
>> >>> > 3)     Support both tab and comma delimited files
>> >>> >
>> >>> > a.     Auto-detect based on header row is easy
>> >>> >
>> >>> > b.    Allow operator to specify the delimiter as a way to override
>> the
>> >>> > auto-detect logic
>> >>> >
>> >>> > 4)     Handle arbitrarily large files...
>> >>> >
>> >>> > a.       should handle CSV files of any length ( achieve this using
>> >>> > batching)
>> >>> >
>> >>> > 5)     Handle errors gracefully
>> >>> >
>> >>> > a.       File failures
>> >>> >
>> >>> > b.      Row failures
>> >>> >
>> >>> > 6)     Support for RFC-4180 formatted CSV files and be sure to
>> handle
>> >>> > edge
>> >>> > cases like embedded newlines in a field value and escaped double
>> quotes
>> >>> >
>> >>> >
>> >>> >
>> >>> > Example:
>> >>> >
>> >>> > Input CSV:
>> >>> >
>> >>> > user,source_ip,source_country,destination_ip,url,timestamp
>> >>> >
>> >>> >
>> >>> > Venkat,192.168.0.1,IN,23.246.97.82,http://www.google.com,201
>> 7-02-22T14:46:24-05:00
>> >>> >
>> >>> >
>> >>> >
>> >>> > Desired output JSON:
>> >>> >
>> >>> >
>> >>> > {"user":"Venkat","source_ip":"192.168.0.1","source_country":
>> "IN","destination_ip":"23.246.97.82","url":"http://www.google.com
>> ","timestamp":"2017-02-22T14:46:24-05:00"}
>> >>> >
>> >>> >
>> >>> >
>> >>> > Implementation:
>> >>> >
>> >>> > 1)      Reviewed all the existing csv libraries which can be used
to
>> >>> > transform csv record to json document by supporting  RFC-4180
>> standard
>> >>> > to
>> >>> > handle embedded new lines in field value and escaped quotes. Found
>> >>> > OpenCSV,
>> >>> > FastCSV, UnivocityCSV Libraries can do this job most effectively.
>> >>> >
>> >>> > 2)      Selected Univocity CSV Library as I can do most of
>> validations
>> >>> > which
>> >>> > are part of my requirements only using this library. When I did
the
>> >>> > performance testing using 5 GB and 10GB arbitrarily large files
this
>> >>> > gave
>> >>> > better results compared any others.
>> >>> >
>> >>> > 3)      Processed CSV Records are being emitted immediately rather
>> than
>> >>> > waiting complete file processing. Used some configurable number
in
>> >>> > processor
>> >>> > to wait until that many records to emit. With this approach I could
>> >>> > process
>> >>> > 5GB CSV data records using 1GB NIFI RAM which is most effective
/
>> >>> > attractive
>> >>> > feature in this whole implementation to handle large files. ( This
>> is
>> >>> > common
>> >>> > limitation in most of processors like SplitText, SplitXML, etc
wait
>> >>> > until
>> >>> > whole file processing and stores the results FlowFile ArrayList
>> within
>> >>> > the
>> >>> > processor this cause heap size/outofmemory issues)
>> >>> >
>> >>> > 4) Handled File errors and record errors gracefully using user
>> defined
>> >>> > configurations and processor routes.
>> >>> >
>> >>> > Can anyone suggest how to proceed further whether I have to open
new
>> >>> > issue
>> >>> > or if I have to use any existing issue. ( I don't find any which
>> >>> > matches to
>> >>> > this requirement)
>> >>
>> >>
>> >>
>> >
>>
>
>
>

Mime
View raw message