nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Tyukin <bo...@boristyukin.com>
Subject Re: AVRO is the only output format with ExecuteSQL
Date Tue, 07 Aug 2018 18:08:12 GMT
now this is really slick! thanks Mark for educating me!

On Tue, Aug 7, 2018 at 1:15 PM Mark Payne <markap14@hotmail.com> wrote:

> Boris,
>
> Using a Record-based processor does not mean that you need to define a
> schema upfront. This is
> necessary if the source itself cannot provide a schema. However, since it
> is pulling structured data
> and the schema can be inferred from the database, you wouldn't need to. As
> Matt was saying, your
> Record Writer can simply be configured to Inherit Record Schema. It can
> then write the schema to
> the "avro.schema" attribute or you can choose "Do Not Write Schema". This
> would still allow the data
> to be written in JSON, CSV, etc.
>
> You could also have the Record Writer choose to write the schema using the
> "avro.schema" attribute,
> as mentioned above, and then have any down-stream processors read the
> schema from this attribute.
> This would allow you to use any record-oriented processors you'd like
> without having to define the
> schema yourself, if you don't want to.
>
> Thanks
> -Mark
>
>
>
> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <boris@boristyukin.com> wrote:
>
> thanks for all the responses! it means I am not the only one interested in
> this topic.
>
> Record-aware version would be really nice, but a lot of times I do not
> want to use record-based processors since I need to define a schema for
> input/output upfront and just want to run SQL query and get whatever
> results back. It just adds an extra step that will be subject to
> break/support.
>
> Similar to Kafka processors, it is nice to have an option of record-based
> processor vs. message oriented processor. But if one processor can do it
> all, it is even better :)
>
>
> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <mattyb149@apache.org> wrote:
>
>> I'm definitely interested in supporting a record-aware version as well
>> (I wrote the Jira up last year [1] but haven't gotten around to
>> implementing it), however I agree with Peter's comment on the Jira.
>> Since ExecuteSQL is an oft-touched processor, if we had two processors
>> that only differed in how the output is formatted, it could be harder
>> to maintain (bugs to be fixed in two places, e.g.). I think we should
>> add an optional RecordWriter property to ExecuteSQL, and the
>> documentation would reflect that if it is not set, the output will be
>> Avro with embedded schema as it has always been. If the RecordWriter
>> is set, either the schema can be hardcoded, or they can use "Inherit
>> Record Schema" even though there's no reader, and that would mimic the
>> current behavior where the schema is inferred from the database
>> columns and used for the writer. There is precedence for this pattern
>> in the SiteToSite reporting tasks.
>>
>> To Bryan's point about history, Avro at the time was the most
>> descriptive of the solutions because it maintains the schema and
>> datatypes with the data, unlike JSON, CSV, etc. Also before the record
>> readers/writers, as Bryan said, you pretty much had to split,
>> transform, merge. We just need to make that processor (and others with
>> specific input/output formats) "record-aware" for better performance.
>>
>> Regards,
>> Matt
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-4517
>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bbende@gmail.com> wrote:
>> >
>> > I would also add that the pattern of splitting to 1 record per flow
>> > file was common before the record processors existed, and generally
>> > this can/should be avoided now in favor of processing/manipulating
>> > records in place, and keeping them together in large batches.
>> >
>> >
>> >
>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <aperepel@gmail.com>
>> wrote:
>> > > Careful, that makes too much sense, Joe ;)
>> > >
>> > >
>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <joe.witt@gmail.com> wrote:
>> > >>
>> > >> i think we just need to make an ExecuteSqlRecord processor.
>> > >>
>> > >> thanks
>> > >>
>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <mikerthomsen@gmail.com>
>> wrote:
>> > >>>
>> > >>> My guess is that it is due to the fact that Avro is the only record
>> type
>> > >>> that can match sql pretty closely feature to feature on data types.
>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin <boris@boristyukin.com>
>> > >>> wrote:
>> > >>>>
>> > >>>> I've been wondering since I started learning NiFi why ExecuteSQL
>> > >>>> processor only returns AVRO formatted data. All community examples
>> I've seen
>> > >>>> then convert AVRO to json and pretty much all of them then
split
>> json to
>> > >>>> multiple flows.
>> > >>>>
>> > >>>> I found myself doing the same thing over and over and over
again.
>> > >>>>
>> > >>>> Since everyone is doing it, is there a strong reason why AVRO
is
>> liked
>> > >>>> so much? And why everyone continues doing this 3 step pattern
>> rather than
>> > >>>> providing users with an option to output json instead and another
>> option to
>> > >>>> output one flowfile or multiple (one per record).
>> > >>>>
>> > >>>> thanks
>> > >>>> Boris
>>
>
>

Mime
View raw message