nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Tyukin <bo...@boristyukin.com>
Subject Re: AVRO is the only output format with ExecuteSQL
Date Mon, 13 Aug 2018 12:34:09 GMT
Matt, you are awesome! 15 files changes and 3k lines of code - man, do not
tell me you did that in just a few days :)

since it has not been merged yet with the master, can I just use your
personal branch to compile entire nifi? or is it better to cherry pick your
commit into master? I would like to try it out

Boris

On Fri, Aug 10, 2018 at 4:55 PM Matt Burgess <mattyb149@apache.org> wrote:

> Boris et al,
>
> I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
> under NIFI-4517, in case anyone wants to play around with it :)
>
> Regards,
> Matt
>
> [1] https://github.com/apache/nifi/pull/2945
> On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin <boris@boristyukin.com> wrote:
> >
> > Matt, you rock!! thank you!!
> >
> > On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess <mattyb149@gmail.com> wrote:
> >>
> >> Sounds good, it makes the underlying code a bit more complicated but I
> see from y’all’s points that a “separate” processor is a better user
> experience. I’m knee deep in it as we speak, hope to have a PR up in a few
> days.
> >>
> >> Thanks,
> >> Matt
> >>
> >>
> >> On Aug 7, 2018, at 5:07 PM, Andrew Grande <aperepel@gmail.com> wrote:
> >>
> >> I'd really like to see the Record suffix on the processor for
> discoverability, as already mentioned.
> >>
> >> Andrew
> >>
> >> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess <mattyb149@apache.org> wrote:
> >>>
> >>> Yeah that's definitely doable, most of the logic for writing a
> >>> ResultSet to a Flow File is localized (currently to JdbcCommon but
> >>> also in ResultSetRecordSet), so I wouldn't think it would be too much
> >>> refactor. What are folks thoughts on whether to add a Record Writer
> >>> property to the existing ExecuteSQL or subclass it to a new processor
> >>> called ExecuteSQLRecord? The former is more consistent with how the
> >>> SiteToSite reporting tasks work, but this is a processor. The latter
> >>> is more consistent with the way we've done other record processors,
> >>> and the benefit there is that we don't have to add a bunch of
> >>> documentation to fields that will be ignored (such as the Use Avro
> >>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
> >>> Having said that, we will want to offer the same options in the Avro
> >>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
> >>>
> >>> Thanks,
> >>> Matt
> >>>
> >>> [1] https://issues.apache.org/jira/browse/NIFI-5405
> >>>
> >>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto <alopresto@apache.org>
> wrote:
> >>> >
> >>> > Matt,
> >>> >
> >>> > Would extending the core ExecuteSQL processor with an
> ExecuteSQLRecord processor also work? I wonder about discoverability if
> only one processor is present and in other places we explicitly name the
> processors which handle records as such. If the ExecuteSQL processor
> handled all the SQL logic, and the ExecuteSQLRecord processor just
> delegated most of the processing in its #onTrigger() method to super, do
> you foresee any substantial difficulties? It might require some refactoring
> of the parent #onTrigger() to service methods.
> >>> >
> >>> >
> >>> > Andy LoPresto
> >>> > alopresto@apache.org
> >>> > alopresto.apache@gmail.com
> >>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >>> >
> >>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande <aperepel@gmail.com>
> wrote:
> >>> >
> >>> > As a side note, one has to ha e a serious justification _not_ to use
> record-based processors. The benefits, including performance, are too
> numerous to call out here.
> >>> >
> >>> > Andrew
> >>> >
> >>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne <markap14@hotmail.com>
> wrote:
> >>> >>
> >>> >> Boris,
> >>> >>
> >>> >> Using a Record-based processor does not mean that you need to
> define a schema upfront. This is
> >>> >> necessary if the source itself cannot provide a schema. However,
> since it is pulling structured data
> >>> >> and the schema can be inferred from the database, you wouldn't
need
> to. As Matt was saying, your
> >>> >> Record Writer can simply be configured to Inherit Record Schema.
It
> can then write the schema to
> >>> >> the "avro.schema" attribute or you can choose "Do Not Write
> Schema". This would still allow the data
> >>> >> to be written in JSON, CSV, etc.
> >>> >>
> >>> >> You could also have the Record Writer choose to write the schema
> using the "avro.schema" attribute,
> >>> >> as mentioned above, and then have any down-stream processors read
> the schema from this attribute.
> >>> >> This would allow you to use any record-oriented processors you'd
> like without having to define the
> >>> >> schema yourself, if you don't want to.
> >>> >>
> >>> >> Thanks
> >>> >> -Mark
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin <boris@boristyukin.com>
> wrote:
> >>> >>
> >>> >> thanks for all the responses! it means I am not the only one
> interested in this topic.
> >>> >>
> >>> >> Record-aware version would be really nice, but a lot of times I
do
> not want to use record-based processors since I need to define a schema for
> input/output upfront and just want to run SQL query and get whatever
> results back. It just adds an extra step that will be subject to
> break/support.
> >>> >>
> >>> >> Similar to Kafka processors, it is nice to have an option of
> record-based processor vs. message oriented processor. But if one processor
> can do it all, it is even better :)
> >>> >>
> >>> >>
> >>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess <mattyb149@apache.org>
> wrote:
> >>> >>>
> >>> >>> I'm definitely interested in supporting a record-aware version
as
> well
> >>> >>> (I wrote the Jira up last year [1] but haven't gotten around
to
> >>> >>> implementing it), however I agree with Peter's comment on the
Jira.
> >>> >>> Since ExecuteSQL is an oft-touched processor, if we had two
> processors
> >>> >>> that only differed in how the output is formatted, it could
be
> harder
> >>> >>> to maintain (bugs to be fixed in two places, e.g.). I think
we
> should
> >>> >>> add an optional RecordWriter property to ExecuteSQL, and the
> >>> >>> documentation would reflect that if it is not set, the output
will
> be
> >>> >>> Avro with embedded schema as it has always been. If the
> RecordWriter
> >>> >>> is set, either the schema can be hardcoded, or they can use
> "Inherit
> >>> >>> Record Schema" even though there's no reader, and that would
mimic
> the
> >>> >>> current behavior where the schema is inferred from the database
> >>> >>> columns and used for the writer. There is precedence for this
> pattern
> >>> >>> in the SiteToSite reporting tasks.
> >>> >>>
> >>> >>> To Bryan's point about history, Avro at the time was the most
> >>> >>> descriptive of the solutions because it maintains the schema
and
> >>> >>> datatypes with the data, unlike JSON, CSV, etc. Also before
the
> record
> >>> >>> readers/writers, as Bryan said, you pretty much had to split,
> >>> >>> transform, merge. We just need to make that processor (and
others
> with
> >>> >>> specific input/output formats) "record-aware" for better
> performance.
> >>> >>>
> >>> >>> Regards,
> >>> >>> Matt
> >>> >>>
> >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-4517
> >>> >>> On Tue, Aug 7, 2018 at 9:20 AM Bryan Bende <bbende@gmail.com>
> wrote:
> >>> >>> >
> >>> >>> > I would also add that the pattern of splitting to 1 record
per
> flow
> >>> >>> > file was common before the record processors existed,
and
> generally
> >>> >>> > this can/should be avoided now in favor of
> processing/manipulating
> >>> >>> > records in place, and keeping them together in large batches.
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> > On Tue, Aug 7, 2018 at 9:10 AM, Andrew Grande <
> aperepel@gmail.com> wrote:
> >>> >>> > > Careful, that makes too much sense, Joe ;)
> >>> >>> > >
> >>> >>> > >
> >>> >>> > > On Tue, Aug 7, 2018, 8:45 AM Joe Witt <joe.witt@gmail.com>
> wrote:
> >>> >>> > >>
> >>> >>> > >> i think we just need to make an ExecuteSqlRecord
processor.
> >>> >>> > >>
> >>> >>> > >> thanks
> >>> >>> > >>
> >>> >>> > >> On Tue, Aug 7, 2018, 8:41 AM Mike Thomsen <
> mikerthomsen@gmail.com> wrote:
> >>> >>> > >>>
> >>> >>> > >>> My guess is that it is due to the fact that
Avro is the only
> record type
> >>> >>> > >>> that can match sql pretty closely feature
to feature on data
> types.
> >>> >>> > >>> On Tue, Aug 7, 2018 at 8:33 AM Boris Tyukin
<
> boris@boristyukin.com>
> >>> >>> > >>> wrote:
> >>> >>> > >>>>
> >>> >>> > >>>> I've been wondering since I started learning
NiFi why
> ExecuteSQL
> >>> >>> > >>>> processor only returns AVRO formatted
data. All community
> examples I've seen
> >>> >>> > >>>> then convert AVRO to json and pretty
much all of them then
> split json to
> >>> >>> > >>>> multiple flows.
> >>> >>> > >>>>
> >>> >>> > >>>> I found myself doing the same thing over
and over and over
> again.
> >>> >>> > >>>>
> >>> >>> > >>>> Since everyone is doing it, is there
a strong reason why
> AVRO is liked
> >>> >>> > >>>> so much? And why everyone continues doing
this 3 step
> pattern rather than
> >>> >>> > >>>> providing users with an option to output
json instead and
> another option to
> >>> >>> > >>>> output one flowfile or multiple (one
per record).
> >>> >>> > >>>>
> >>> >>> > >>>> thanks
> >>> >>> > >>>> Boris
> >>> >>
> >>> >>
> >>> >
>

Mime
View raw message