freemarker-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siegfried Goeschl <siegfried.goes...@gmail.com>
Subject Re: freemarker-generator: Improving the input documents concept
Date Sun, 01 Mar 2020 08:47:48 GMT
HI Daniel,

Please see my comments below

Thanks in advance, 

Siegfried Goeschl


> On 29.02.2020, at 21:02, Daniel Dekany <daniel.dekany@gmail.com> wrote:
> 
>> 
>> I try to provide a useful name even when the content is coming from an
>> URL
> 
> 
> When is it recommended to rely on that though? Because utilizing that means
> that renaming a data source file can break the process, even if you call
> freemarker-cli with the up to date file name. And if that happens depends
> on what you (or an other random colleague!) have dug inside the templates.
> So I guess we better just don't support this. Less code and less things to
> document too.
> 

Actually not recommended but we have named data sources for less than 24 hours

> 
>> I think we have a different understanding what a "Document" / "Datasource
>> / DataSource" should do
> 
> 
> Thing is, eventually (most certainly pre-1.0, as it influences
> architecture), certain needs will have to addressed, somehow. Then we will
> see what "things" we really need. For now I though we need "things" that
> are much more than paths, and encapsulate the "how to load the data"
> aspect. I called them data sources, but maybe we should called them "data
> loaders" to free up data sources for the more primitive thing. Some
> needs/doubts to address, *later*: Is it really the best approach for users
> to load/parse data sources programmatically (that coded is written in FTL,
> inside the templates)? Also, is the template the right place for doing
> that, because, when multiple templates (or just multiple template *runs* of
> the same template, each generating a different output file) needs common
> data, they shouldn't load it again and again. Also, different topic, can we
> handle the case "transparently" enough when the data is not coming from a
> file?

This is a command line tool where we have little idea what the user will do or abuse

* How does a "data loader" knows that it is responsible to load a file
* What should as "CSV data loader" should do - parse it into a list of records or stream one by one?
* How to handle the case if you have multiple potential data loaders for a single file?

I'm leaning towards building blocks where the user controls the work to be done even it requires one to two extra lines of FTL code


> 
> The joy of programming - I did not intend to use "name:group" together with
>> wildcards :-)
> 
> 
> For a CLI tool, I guess we agree that it should work. So maybe, like this
> (here logs and foos meant to be "groups"):
> --data-source logs file1.log file2.log fileN.log   --data-source foos
> foo1.csv foo2.csv fooN.csv  --data-source bar bar.xlsx
> 
> It so happens that here you don't really have a good control about the
> number of files associated to the name, so, maybe yet another reason to not
> differentiate names and groups.
> 
> I Disagree here - I think using a name would be used more often. I added
>> the "group" as an afterthought since some grouping could be useful
> 
> 
> We do agree in that. What I said is that the *syntax* should be so that the
> group comes first. It's still optional. Like this:
> --data-source group:name /somewhere
> --data-source name /somewhere

That's comes down to personal preferences, e.g. chown uses "owner[:group] "

> 
> On Sat, Feb 29, 2020 at 7:34 PM Siegfried Goeschl <
> siegfried.goeschl@gmail.com> wrote:
> 
>> HI Daniel,
>> 
>> Seem my comments below
>> 
>> Thanks in advance,
>> 
>> Siegfried Goeschl
>> 
>> 
>>> On 29.02.2020, at 19:08, Daniel Dekany <daniel.dekany@gmail.com> wrote:
>>> 
>>> FREEMARKER-135 freemarker-generator-cli: Support user-supplied names for
>>> datasources
>>> 
>>> So, I can do this to have both a name an a group associated to a data
>>> source:
>>> --datasource someName:someGroup=somewhere/something
>> 
>> Correct
>> 
>>> Or if I only want a name, but not a group (or an ""  group actually -
>>> bug?), then:
>>> --datasource someName=somewhere/something
>> 
>> Correct
>> 
>>> 
>>> Or if only a group but not a name (or a "" name actually) then:
>>> --datasource :someGroup=somewhere/something
>> 
>> Mhmm, that would be unintended functionality from my side - current
>> approach is that every "Document" / "Datasource / DataSource" is named
>> 
>>> 
>>> A name must identify exactly 1 data source, while a group identifies a
>> list
>>> of data sources.
>> 
>> No, every "Document" / "Datasource / DataSource" has a name currently but
>> uniqueness is not enforced. Only if you want to get a "Document" /
>> "Datasource / DataSource" with it's exact name I checked for exactly one
>> search hit and throw an exception. I try to provide a useful name even when
>> the content is coming from an URL or STDIN (and I will probably add
>> environment variables as "Document" / "Datasource / DataSource", e.g
>> configuration in the cloud as JSON content passed as environment variable)
>> 
>>> 
>>> Is that this idea, that the a data source can be part of a group, and
>> then
>>> is also possibly identifiable with a name comes from an use case? I mean,
>>> it's possibly important somewhere, but if so, then it's strange that you
>>> can put something into only a single group. If we need this kind of
>> thing,
>>> then perhaps you should be just allowed to associate the data source
>> with a
>>> list of names (kind of like tagging), and then when the template wants to
>>> get something by name, it will tell there if it expects exactly one or a
>>> list of data sources. Then you don't need to introduce two terms in the
>>> documentation either (names and groups). Again, if we want this at all,
>>> instead of just going with a data source that itself gives a list. (And
>> if
>>> not, how will we handle a data source that loads from a non-file source?)
>> 
>> I actually thought of implementing tagging but considered a "group"
>> sufficient.
>> 
>> * If you don't define anything everything goes into the "default" group
>> * For individual documents you can define a name and an optional group
>> 
>> I think we have a different understanding what a "Document" / "Datasource
>> / DataSource" should do
>> 
>> * It is a dumb
>> * It is lazy since data is only loaded on demand
>> * There is no automagic like "oh, this is a JSON file, so let's go to the
>> JSON tool and create a map readily accessible in the data model"
>> 
>>> 
>>> Note that the current command line syntax doesn't work well with shell
>>> wildcard expansion. Like this:
>>> --datasource :someGroup=logs/*.log
>>> will try to expand ":someGroup=logs/*.log", and because it finds nothing
>>> (and because the rules of sh and the like is a mess), you will get the
>>> parameter value as is, without * expanded.
>> 
>> The joy of programming - I did not intend to use "name:group" together
>> with wildcards :-)
>> 
>>> 
>>> Also,  I think the syntax with colon should be flipped, because on other
>>> places foo:bar usually means that foo is the bigger unit (the container),
>>> and bar is the smaller unit (the child).
>> 
>> I Disagree here - I think using a name would be used more often. I added
>> the "group" as an afterthought since some grouping could be useful
>> 
>>> 
>>> On Sat, Feb 29, 2020 at 5:03 PM Siegfried Goeschl <
>>> siegfried.goeschl@gmail.com> wrote:
>>> 
>>>> Hi Daniel,
>>>> 
>>>> I'm an enterprise developer - bad habits die hard :-)
>>>> 
>>>> So I closed the following tickets and merged the branches
>>>> 
>>>> 1) FREEMARKER-129 freemarker-generator: Merge "freemarker-cli" into
>>>> "freemarker-generator"
>>>> 2) FREEMARKER-134 freemarker-generator: Rename "Document" to
>> "Datasource"
>>>> 3) FREEMARKER-135 freemarker-generator-cli: Support user-supplied names
>>>> for datasources
>>>> 
>>>> Thanks in advance,
>>>> 
>>>> Siegfried Goeschl
>>>> 
>>>> 
>>>>> On 29.02.2020, at 12:19, Daniel Dekany <daniel.dekany@gmail.com>
>> wrote:
>>>>> 
>>>>> Yeah, and of course, you can merge that branch. You can even work on
>> the
>>>>> master directly after all.
>>>>> 
>>>>> On Sat, Feb 29, 2020 at 12:17 PM Daniel Dekany <
>> daniel.dekany@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> But, I do recognize the cattle use case (several "faceless" files with
>>>>>> common format/schema). Only, my idea is to push that complexity on the
>>>> data
>>>>>> source. The "data source" concept shields the rest of the application
>>>> from
>>>>>> the details of how the data is stored or retrieved. So, a data source
>>>> might
>>>>>> loads a bunch of log files from a directory, and present them as a
>>>> single
>>>>>> big table, or like a list of tables, etc. So I want to deal with the
>>>> cattle
>>>>>> use case, but the question is what part of the of architecture will
>> deal
>>>>>> with this complication, with other words, how do you box things. Why
>> my
>>>>>> initial bet is to stuff that complication into the "data source"
>>>>>> implementation(s) is that data sources are inherently varied. Some
>>>> returns
>>>>>> a table-like thing, some have multiple named tables (worksheets in
>>>> Excel),
>>>>>> some returns tree of nodes (XML), etc. So then, some might returns a
>>>>>> list-of-list-of log records, or just a single list of log-records (put
>>>>>> together from daily log files). That way cattles don't add to
>> conceptual
>>>>>> complexity. Now, you might be aware of cases where the cattle concept
>>>> must
>>>>>> be more exposed than this, and the we can't box things like this. But
>>>> this
>>>>>> is what I tried to express.
>>>>>> 
>>>>>> Regarding "output generators", and how that applies on the command
>>>> line. I
>>>>>> think it's important that the common core between Maven and
>>>> command-line is
>>>>>> as fat as possible. Ideally, they are just two syntax to set up the
>> same
>>>>>> thing. Mostly at least. So, if you specify a template file to the CLI
>>>>>> application, in a way so that it causes it to process that template to
>>>>>> generate a single output, then there you have just defined an "output
>>>>>> generator" (even if it wasn't explicitly called like that in the
>> command
>>>>>> line). If you specify 3 csv files to the CLI application, in a way so
>>>> that
>>>>>> it causes it to generate 3 output files, then you have just defined 3
>>>>>> "output generators" there (there's at least one template specified
>> there
>>>>>> too, but that wasn't an "output generator" itself, it was just an
>>>> attribute
>>>>>> of the 3 output generators). If you specify 1 template, and 3 csv
>>>> files, in
>>>>>> a way so that it will yield 4 output files (1 for the template, 3 for
>>>> the
>>>>>> csv-s), then you have defined 4 output generators there. If you have a
>>>> data
>>>>>> source that loads a list of 3 entities (say, 3 csv files, so it's a
>>>> list of
>>>>>> tables then), and you have 2 templates, and you tell the CLI to
>> execute
>>>>>> each template for each item in said data source, then you have just
>>>> defined
>>>>>> 6 "output generators".
>>>>>> 
>>>>>> On Fri, Feb 28, 2020 at 11:08 AM Siegfried Goeschl <
>>>>>> siegfried.goeschl@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi Daniel,
>>>>>>> 
>>>>>>> That all depends on your mental model and work you do, expectations,
>>>>>>> experience :-)
>>>>>>> 
>>>>>>> 
>>>>>>> __Document Handling__
>>>>>>> 
>>>>>>> *"But I think actually we have no good use case for list of documents
>>>>>>> that's passed at once to a single template run, so, we can just
>> ignore
>>>>>>> that complication"*
>>>>>>> 
>>>>>>> In my case that's not a complication but my daily business - I'm
>>>>>>> regularly wading through access logs - yesterday probably a couple of
>>>>>>> hundreds access logs across two staging sites to help tracking some
>>>>>>> strange API gateway issues :-)
>>>>>>> 
>>>>>>> My gut feeling is (borrowing from
>>>>>>> 
>>>>>>> 
>>>> 
>> https://medium.com/@Joachim8675309/devops-concepts-pets-vs-cattle-2380b5aab313
>>>>>>> )
>>>>>>> 
>>>>>>> 1. You have a few lovely named documents / templates - `pets`
>>>>>>> 2. You have tons of anonymous documents / templates to process -
>>>>>>> `cattle`
>>>>>>> 3. The "grey area" comes into play when mixing `pets & cattle`
>>>>>>> 
>>>>>>> `freemarker-cli` was built with 2) in mind and I want to cover 1)
>> since
>>>>>>> it is equally important and common.
>>>>>>> 
>>>>>>> 
>>>>>>> __Template And Document Processing Modes__
>>>>>>> 
>>>>>>> IMHO it is important to answer the following question : "How many
>>>>>>> outputs do you get when rendering 2 template and 3 datasources? Two,
>>>>>>> Three or Six?"
>>>>>>> 
>>>>>>> Your answer is influenced by your mental model / experience
>>>>>>> 
>>>>>>> * When wading through tons of CSV files, access logs, etc. the answer
>>>> is
>>>>>>> "2"
>>>>>>> * When doing source code generation the obvious answer is "6"
>>>>>>> * Can't image a use case which results in "3" but I'm pretty sure we
>>>>>>> will encounter one
>>>>>>> 
>>>>>>> __Template and document mode probably shouldn't exist__
>>>>>>> 
>>>>>>> That's hard for me to fully understand - I definitely lack your
>>>> insights
>>>>>>> & experience writing such tools :-)
>>>>>>> 
>>>>>>> Defining the `Output Generator` is the underlying model for the Maven
>>>>>>> plugin (and probably FMPP).
>>>>>>> 
>>>>>>> I'm not sure if this applies for command lines at least not in the
>> way
>>>> I
>>>>>>> use them (or would like to use them)
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks in advance,
>>>>>>> 
>>>>>>> Siegfried Goeschl
>>>>>>> 
>>>>>>> PS: Can/shall I merge the PR to bring in `freemarker-cli`?
>>>>>>> 
>>>>>>> 
>>>>>>> On 28 Feb 2020, at 9:14, Daniel Dekany wrote:
>>>>>>> 
>>>>>>>> Yeah, "data source" is surely a too popular name, but for reason.
>>>>>>>> Anyone
>>>>>>>> has other ideas?
>>>>>>>> 
>>>>>>>> As of naming data sources and such. One thing I was wondering about
>>>>>>>> back
>>>>>>>> then is how to deal with list of documents given to a template,
>> versus
>>>>>>>> exactly 1 document given to a template. But I think actually we have
>>>>>>>> no
>>>>>>>> good use case for list of documents that's passed at once to a
>> single
>>>>>>>> template run, so, we can just ignore that complication. A document
>> has
>>>>>>>> a
>>>>>>>> name, and that's always just a single document, not a collection, as
>>>>>>>> far as
>>>>>>>> the template is concerned. (We can have multiple documents per run,
>>>>>>>> but
>>>>>>>> those normally yield separate output generators, so it's still only
>>>>>>>> one
>>>>>>>> document per template.) However, we can have data source types
>>>>>>>> (document
>>>>>>>> types with old terminology) that collect together multiple data
>> files.
>>>>>>>> So
>>>>>>>> then that complexity is encapsulated into the data source type, and
>>>>>>>> doesn't
>>>>>>>> complicate the overall architecture. That's another case when a data
>>>>>>>> source
>>>>>>>> is not just a file. Like maybe there's a data source type that loads
>>>>>>>> all
>>>>>>>> the CSV-s from a directory, into a single big table (I had such
>> case),
>>>>>>>> or
>>>>>>>> even into a list of tables. Or, as I mentioned already, a data
>> source
>>>>>>>> is
>>>>>>>> maybe an SQL query on a JDBC data source (and we got the first term
>>>>>>>> clash... JDBC also call them data sources).
>>>>>>>> 
>>>>>>>> Template and document mode probably shouldn't exist from user
>>>>>>>> perspective
>>>>>>>> either, at least not as a global option that must apply to
>> everything
>>>>>>>> in a
>>>>>>>> run. They could just give the files that define the "output
>>>>>>>> generators",
>>>>>>>> and some of them will be templates, some of them are data files, in
>>>>>>>> which
>>>>>>>> case a template need to be associated with them (and there can be a
>>>>>>>> couple
>>>>>>>> of ways of doing that). And then again, there are the cases where
>> you
>>>>>>>> want
>>>>>>>> to create one output generator per entity from some data source.
>>>>>>>> 
>>>>>>>> On Fri, Feb 28, 2020 at 8:23 AM Siegfried Goeschl <
>>>>>>>> siegfried.goeschl@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi Daniel,
>>>>>>>>> 
>>>>>>>>> See my comments below - and thanks for your patience and input :-)
>>>>>>>>> 
>>>>>>>>> *Renaming Document To DataSource*
>>>>>>>>> 
>>>>>>>>> Yes, makes sense. I tried to avoid since I'm using javax.activation
>>>>>>>>> and
>>>>>>>>> its DataSource.
>>>>>>>>> 
>>>>>>>>> *Template And Document Mode*
>>>>>>>>> 
>>>>>>>>> Agreed - I think it is a valuable abstraction for the user but it
>> is
>>>>>>>>> not
>>>>>>>>> an implementation concept :-)
>>>>>>>>> 
>>>>>>>>> *Document Without Symbolic Names*
>>>>>>>>> 
>>>>>>>>> Also agreed and it is going to change but I have not settled my
>> mind
>>>>>>>>> yet
>>>>>>>>> what exactly to implement.
>>>>>>>>> 
>>>>>>>>> Thanks in advance,
>>>>>>>>> 
>>>>>>>>> Siegfried Goeschl
>>>>>>>>> 
>>>>>>>>> On 28 Feb 2020, at 1:05, Daniel Dekany wrote:
>>>>>>>>> 
>>>>>>>>> A few quick thoughts on that:
>>>>>>>>> 
>>>>>>>>> - We should replace the "document" term with something more
>> speaking.
>>>>>>>>> It
>>>>>>>>> doesn't tell that it's some kind of input. Also, most of these
>> inputs
>>>>>>>>> aren't something that people typically call documents. Like a csv
>>>>>>>>> file, or
>>>>>>>>> a database table, which is not even a file (OK we don't support
>> such
>>>>>>>>> thing
>>>>>>>>> at the moment). I think, maybe "data source" is a safe enough term.
>>>>>>>>> (It
>>>>>>>>> also rhymes with data model.)
>>>>>>>>> - You have separate "template" and "document" "mode", that applies
>> to
>>>>>>>>> a
>>>>>>>>> whole run. I think such specialization won't be helpful. We could
>>>>>>>>> just say,
>>>>>>>>> on the conceptual level at lest, that we need a set of "outputs
>>>>>>>>> generators". An output generator is an object (in the API) that
>>>>>>>>> specifies a
>>>>>>>>> template, a data-model (where the data-model is possibly populated
>>>>>>>>> with
>>>>>>>>> "documents"), and an output "sink" (a file path, or stdout), and
>> can
>>>>>>>>> generate the output itself. A practical way of defining the output
>>>>>>>>> generators in a CLI application is via a bunch of files, each
>>>>>>>>> defining an
>>>>>>>>> output generator. Some of those files is maybe a template (that you
>>>>>>>>> can
>>>>>>>>> even detect from the file extension), or a data file that we
>>>>>>>>> currently call
>>>>>>>>> a "document". They could freely mix inside the same run. I have
>> also
>>>>>>>>> met
>>>>>>>>> use case when you have a single table (single "document"), and each
>>>>>>>>> record
>>>>>>>>> in it yields an output file. That can also be described in some
>> file
>>>>>>>>> format, or really in any other way, like directly in command line
>>>>>>>>> argument,
>>>>>>>>> via API, etc.
>>>>>>>>> - You have multiple documents without associated symbolical name in
>>>>>>>>> some
>>>>>>>>> examples. Templates can't identify those then in a well
>> maintainable
>>>>>>>>> way.
>>>>>>>>> The actual file name is often not a good identifier, can change
>> over
>>>>>>>>> time,
>>>>>>>>> and you might don't even have good control over it, like you
>> already
>>>>>>>>> receive it as a parameter from somewhere else, or someone
>>>>>>>>> moves/renames
>>>>>>>>> that files that you need to read. Index is also not very good, but
>> I
>>>>>>>>> have
>>>>>>>>> written about that earlier.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Feb 26, 2020 at 9:33 AM Siegfried Goeschl <
>>>>>>>>> siegfried.goeschl@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi folks,
>>>>>>>>> 
>>>>>>>>> still wrapping my side around but assembled some thoughts here -
>>>>>>>>> https://gist.github.com/sgoeschl/b09b343a761b31a6c790d882167ff449
>>>>>>>>> 
>>>>>>>>> Thanks in advance,
>>>>>>>>> 
>>>>>>>>> Siegfried Goeschl
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 23 Feb 2020, at 23:14, Daniel Dekany <ddekany@apache.org>
>> wrote:
>>>>>>>>> 
>>>>>>>>> What you are describing is more like the angle that FMPP took
>>>>>>>>> initially,
>>>>>>>>> where templates drive things, they generate the output for
>> themselves
>>>>>>>>> 
>>>>>>>>> (even
>>>>>>>>> 
>>>>>>>>> multiple output files if they wish). By default output files name
>>>>>>>>> (and
>>>>>>>>> relative path) is deduced from template name. There was also a
>> global
>>>>>>>>> data-model, built in a configuration file (or equally, built via
>>>>>>>>> command
>>>>>>>>> line arguments, or both mixed), from which templates get whatever
>>>>>>>>> data
>>>>>>>>> 
>>>>>>>>> they
>>>>>>>>> 
>>>>>>>>> are interested in. Take a look at the figures here:
>>>>>>>>> http://fmpp.sourceforge.net/qtour.html. Later, this concept was
>>>>>>>>> 
>>>>>>>>> generalized
>>>>>>>>> 
>>>>>>>>> a bit more, because you could add XML files at the same place where
>>>>>>>>> you
>>>>>>>>> have the templates, and then you could associate transform
>> templates
>>>>>>>>> to
>>>>>>>>> 
>>>>>>>>> the
>>>>>>>>> 
>>>>>>>>> XML files (based on path pattern and/or the XML document element).
>>>>>>>>> Now
>>>>>>>>> that's like what freemarker-generator had initially (data files
>> drive
>>>>>>>>> output, and the template is there to transform it).
>>>>>>>>> 
>>>>>>>>> So I think the generic mental model would like this:
>>>>>>>>> 
>>>>>>>>> 1. You got files that drive the process, let's call them *generator
>>>>>>>>> files* for now. Usually, each generator file yields an output file
>>>>>>>>> (but
>>>>>>>>> maybe even multiple output files, as you might saw in the last
>>>>>>>>> figure).
>>>>>>>>> These generator files can be of many types, like XML, JSON, XLSX
>> (as
>>>>>>>>> 
>>>>>>>>> in the
>>>>>>>>> 
>>>>>>>>> original freemarker-generator), and even templates (as is the norm
>> in
>>>>>>>>> FMPP). If the file is not a template, then you got a set of
>>>>>>>>> transformer
>>>>>>>>> templates (-t CLI option) in a separate directory, which can be
>>>>>>>>> 
>>>>>>>>> associated
>>>>>>>>> 
>>>>>>>>> with the generator files base on name patterns, and even based on
>>>>>>>>> 
>>>>>>>>> content
>>>>>>>>> 
>>>>>>>>> (schema usually). If the generator file is a template (so that's a
>>>>>>>>> positional @Parameter CLI argument that happens to be an *.ftl, and
>>>>>>>>> is
>>>>>>>>> 
>>>>>>>>> not
>>>>>>>>> 
>>>>>>>>> a template file specified after the "-t" option), then you just
>>>>>>>>> Template.process(...) it, and it prints what the output will be.
>>>>>>>>> 2. You also have a set of variables, the global data-model, that
>>>>>>>>> contains commonly useful stuff, like what you now call parameters
>>>>>>>>> (CLI
>>>>>>>>> -Pname=value), but also maybe data loaded from JSON, XML, etc..
>> Those
>>>>>>>>> 
>>>>>>>>> data
>>>>>>>>> 
>>>>>>>>> files aren't "generator files". Templates just use them if they
>> need
>>>>>>>>> 
>>>>>>>>> them.
>>>>>>>>> 
>>>>>>>>> An important thing here is to reuse the same mechanism to read and
>>>>>>>>> 
>>>>>>>>> parse
>>>>>>>>> 
>>>>>>>>> those data files, which was used in templates when transforming
>>>>>>>>> 
>>>>>>>>> generator
>>>>>>>>> 
>>>>>>>>> files. So we need a common format for specifying how to load data
>>>>>>>>> 
>>>>>>>>> files.
>>>>>>>>> 
>>>>>>>>> That's maybe just FTL that #assigns to the variables, or maybe more
>>>>>>>>> declarative format.
>>>>>>>>> 
>>>>>>>>> What I have described in the original post here was a less generic
>>>>>>>>> form
>>>>>>>>> 
>>>>>>>>> of
>>>>>>>>> 
>>>>>>>>> this, as I tried to be true with the original approach. I though
>> the
>>>>>>>>> proposal will be drastic enough as it is... :) There, the "main"
>>>>>>>>> document
>>>>>>>>> is the "generator file" from point 1, the "-t" template is the
>>>>>>>>> transform
>>>>>>>>> template for the "main" document, and the other named documents
>>>>>>>>> ("users",
>>>>>>>>> "groups") is a poor man's shared data-model from point 2 (together
>>>>>>>>> with
>>>>>>>>> with -PName=value).
>>>>>>>>> 
>>>>>>>>> There's further somewhat confusing thing to get right with the
>>>>>>>>> list-of-documents (`DocuentList`, `NamedDocumentLists`) thing
>> though.
>>>>>>>>> In
>>>>>>>>> the model above, as per point 1, if you list multiple data files,
>>>>>>>>> each
>>>>>>>>> 
>>>>>>>>> will
>>>>>>>>> 
>>>>>>>>> generate a separate output file. So, if you need take in a list of
>>>>>>>>> files
>>>>>>>>> 
>>>>>>>>> to
>>>>>>>>> 
>>>>>>>>> transform it to a single output file (or at least with a single
>>>>>>>>> transform
>>>>>>>>> template execution), then you have to be explicit about that, as
>>>>>>>>> that's
>>>>>>>>> 
>>>>>>>>> not
>>>>>>>>> 
>>>>>>>>> the default behavior anymore. But it's still absolutely possible.
>>>>>>>>> Imagine
>>>>>>>>> it as a "list of XLSX-es" is itself like a file format. You need
>> some
>>>>>>>>> CLI
>>>>>>>>> (and Maven config, etc.) syntax to express that, but that shouldn't
>>>>>>>>> be a
>>>>>>>>> big deal.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sun, Feb 23, 2020 at 9:43 PM Siegfried Goeschl <
>>>>>>>>> siegfried.goeschl@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Daniel,
>>>>>>>>> 
>>>>>>>>> Good timing - I was looking at a similar problem from different
>> angle
>>>>>>>>> yesterday (see below)
>>>>>>>>> 
>>>>>>>>> Don't have enough time to answer your email in detail now - will do
>>>>>>>>> that
>>>>>>>>> tomorrow evening
>>>>>>>>> 
>>>>>>>>> Thanks in advance,
>>>>>>>>> 
>>>>>>>>> Siegfried Goeschl
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ===. START
>>>>>>>>> # FreeMarker CLI Improvement
>>>>>>>>> ## Support Of Multiple Template Files
>>>>>>>>> Currently we support the following combinations
>>>>>>>>> 
>>>>>>>>> * Single template and no data files
>>>>>>>>> * Single template and one or more data files
>>>>>>>>> 
>>>>>>>>> But we can not support the following use case which is quite
>> typical
>>>>>>>>> in
>>>>>>>>> the cloud
>>>>>>>>> 
>>>>>>>>> __Convert multiple templates with a single data file, e.g copying a
>>>>>>>>> directory of configuration files using a JSON configuration file__
>>>>>>>>> 
>>>>>>>>> ## Implementation notes
>>>>>>>>> * When we copy a directory we can remove the `ftl`extension on the
>>>>>>>>> fly
>>>>>>>>> * We might need an `exclude` filter for the copy operation
>>>>>>>>> * Initially resolve to a list of template files and process one
>> after
>>>>>>>>> another
>>>>>>>>> * Need to calculate the output file location and extension
>>>>>>>>> * We need to rename the existing command line parameters (see
>> below)
>>>>>>>>> * Do we need multiple include and exclude filter?
>>>>>>>>> * Do we need file versus directory filters?
>>>>>>>>> 
>>>>>>>>> ### Command Line Options
>>>>>>>>> ```
>>>>>>>>> --input-encoding : Encoding of the documents
>>>>>>>>> --output-encoding : Encoding of the rendered template
>>>>>>>>> --template-encoding : Encoding of the template
>>>>>>>>> --output : Output file or directory
>>>>>>>>> --include-document : Include pattern for documents
>>>>>>>>> --exclude-document : Exclude pattern for documents
>>>>>>>>> --include-template: Include pattern for templates
>>>>>>>>> --exclude-template : Exclude pattern for templates
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> ### Command Line Examples
>>>>>>>>> ```text
>>>>>>>>> # Copy all FTL templates found in "ext/config" to the "/config"
>>>>>>>>> 
>>>>>>>>> directory
>>>>>>>>> 
>>>>>>>>> using the data from "config.json"
>>>>>>>>> 
>>>>>>>>> freemarker-cli -t ./ext/config --include-template *.ftl --o /config
>>>>>>>>> 
>>>>>>>>> config.json
>>>>>>>>> 
>>>>>>>>> freemarker-cli --template ./ext/config --include-template *.ftl
>>>>>>>>> 
>>>>>>>>> --output
>>>>>>>>> 
>>>>>>>>> /config config.json
>>>>>>>>> 
>>>>>>>>> # Bascically the same using a named document "configuration"
>>>>>>>>> # It might make sense to expose "conf" directly in the FreeMarker
>>>>>>>>> data
>>>>>>>>> model
>>>>>>>>> # It might make sens to allow URIs for loading documents
>>>>>>>>> 
>>>>>>>>> freemarker-cli -t ./ext/config/*.ftl -o /config -d
>>>>>>>>> 
>>>>>>>>> configuration=config.json
>>>>>>>>> 
>>>>>>>>> freemarker-cli --template ./ext/config --include-template *.ftl
>>>>>>>>> 
>>>>>>>>> --output
>>>>>>>>> 
>>>>>>>>> /config --document configuration=config.json
>>>>>>>>> 
>>>>>>>>> freemarker-cli --template ./ext/config --include-template *.ftl
>>>>>>>>> 
>>>>>>>>> --output
>>>>>>>>> 
>>>>>>>>> /config --document configuration=file:///config.json
>>>>>>>>> 
>>>>>>>>> # Bascically the same using an environment variable as named
>> document
>>>>>>>>> 
>>>>>>>>> freemarker-cli -t ./ext/config --include-template *.ftl -o /config
>> -d
>>>>>>>>> 
>>>>>>>>> configuration=env:///CONFIGURATION
>>>>>>>>> 
>>>>>>>>> freemarker-cli --template ./ext/config --include-template *.ftl
>>>>>>>>> 
>>>>>>>>> --output
>>>>>>>>> 
>>>>>>>>> /config --document configuration=env:///CONFIGURATION
>>>>>>>>> ```
>>>>>>>>> === END
>>>>>>>>> 
>>>>>>>>> On 23.02.2020, at 16:37, Daniel Dekany <ddekany@apache.org> wrote:
>>>>>>>>> 
>>>>>>>>> Input documents is a fundamental concept in freemarker-generator,
>> so
>>>>>>>>> we
>>>>>>>>> should think about that more, and probably refine/rework how it's
>>>>>>>>> done.
>>>>>>>>> 
>>>>>>>>> Currently it works like this, with CLI at least.
>>>>>>>>> 
>>>>>>>>> freemarker-cli
>>>>>>>>> -t access-report.ftl
>>>>>>>>> somewhere/foo-access-log.csv
>>>>>>>>> 
>>>>>>>>> Then in access-report.ftl you have to do something like this:
>>>>>>>>> 
>>>>>>>>> <#assign doc = Documents.get(0)>
>>>>>>>>> ... process doc here
>>>>>>>>> 
>>>>>>>>> (The more idiomatic Documents[0] won't work. Actually, that lead
>> to a
>>>>>>>>> 
>>>>>>>>> funny
>>>>>>>>> 
>>>>>>>>> chain of coincidences: It returned the string "D", then
>>>>>>>>> 
>>>>>>>>> CSVTool.parse(...)
>>>>>>>>> 
>>>>>>>>> happily parsed that to a table with the single column "D", and 0
>>>>>>>>> rows,
>>>>>>>>> 
>>>>>>>>> and
>>>>>>>>> 
>>>>>>>>> as there were 0 rows, the template didn't run into an error because
>>>>>>>>> row.myExpectedColumn refers to a missing column either, so the
>>>>>>>>> process
>>>>>>>>> finished with success. (: Pretty unlucky for sure. The root was
>>>>>>>>> unintentionally breaking a FreeMarker idiom though; eventually we
>>>>>>>>> will
>>>>>>>>> 
>>>>>>>>> have
>>>>>>>>> 
>>>>>>>>> to work on those too, but, different topic.)
>>>>>>>>> 
>>>>>>>>> However, actually multiple input documents can be passed in:
>>>>>>>>> 
>>>>>>>>> freemarker-cli
>>>>>>>>> -t access-report.ftl
>>>>>>>>> somewhere/foo-access-log.csv
>>>>>>>>> somewhere/bar-access-log.csv
>>>>>>>>> 
>>>>>>>>> Above template will still work, though then you ignored all but the
>>>>>>>>> 
>>>>>>>>> first
>>>>>>>>> 
>>>>>>>>> document. So if you expect any number of input documents, you
>>>>>>>>> probably
>>>>>>>>> 
>>>>>>>>> will
>>>>>>>>> 
>>>>>>>>> have to do this:
>>>>>>>>> 
>>>>>>>>> <#list Documents.list as doc>
>>>>>>>>> ... process doc here
>>>>>>>>> </#list>
>>>>>>>>> 
>>>>>>>>> (The more idiomatic <#list Documents as doc> won't work; but again,
>>>>>>>>> 
>>>>>>>>> those
>>>>>>>>> 
>>>>>>>>> we will work out in a different thread.)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> So, what would be better, in my opinion. I start out from what I
>>>>>>>>> think
>>>>>>>>> 
>>>>>>>>> are
>>>>>>>>> 
>>>>>>>>> the common uses cases, in decreasing order of frequency. Goal is to
>>>>>>>>> 
>>>>>>>>> make
>>>>>>>>> 
>>>>>>>>> those less error prone for the users, and simpler to express.
>>>>>>>>> 
>>>>>>>>> USE CASE 1
>>>>>>>>> 
>>>>>>>>> You have exactly 1 input documents, which is therefore simply "the"
>>>>>>>>> document in the mind of the user. This is probably the typical use
>>>>>>>>> 
>>>>>>>>> case,
>>>>>>>>> 
>>>>>>>>> but at least the use case users typically start out from when
>>>>>>>>> starting
>>>>>>>>> 
>>>>>>>>> the
>>>>>>>>> 
>>>>>>>>> work.
>>>>>>>>> 
>>>>>>>>> freemarker-cli
>>>>>>>>> -t access-report.ftl
>>>>>>>>> somewhere/foo-access-log.csv
>>>>>>>>> 
>>>>>>>>> Then `Documents.get(0)` is not very fitting. Most importantly it's
>>>>>>>>> 
>>>>>>>>> error
>>>>>>>>> 
>>>>>>>>> prone, because if the user passed in more than 1 documents (can
>> even
>>>>>>>>> 
>>>>>>>>> happen
>>>>>>>>> 
>>>>>>>>> totally accidentally, like if the user was lazy and used a wildcard
>>>>>>>>> 
>>>>>>>>> that
>>>>>>>>> 
>>>>>>>>> the shell exploded), the template will silently ignore the rest of
>>>>>>>>> the
>>>>>>>>> documents, and the singe document processed will be practically
>>>>>>>>> picked
>>>>>>>>> randomly. The user might won't notice that and submits a bad report
>>>>>>>>> or
>>>>>>>>> 
>>>>>>>>> such.
>>>>>>>>> 
>>>>>>>>> I think that in this use case the document should be simply
>> referred
>>>>>>>>> as
>>>>>>>>> `Document` in the template. When you have multiple documents there,
>>>>>>>>> referring to `Document` should be an error, saying that the
>> template
>>>>>>>>> 
>>>>>>>>> was
>>>>>>>>> 
>>>>>>>>> made to process a single document only.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> USE CASE 2
>>>>>>>>> 
>>>>>>>>> You have multiple input documents, but each has different role
>>>>>>>>> 
>>>>>>>>> (different
>>>>>>>>> 
>>>>>>>>> schema, maybe different file type). Like, you pass in users.csv and
>>>>>>>>> groups.csv. Each has difference schema, and so you want to access
>>>>>>>>> them
>>>>>>>>> differently, but in the same template.
>>>>>>>>> 
>>>>>>>>> freemarker-cli
>>>>>>>>> [...]
>>>>>>>>> --named-document users somewhere/foo-users.csv
>>>>>>>>> --named-document groups somewhere/foo-groups.csv
>>>>>>>>> 
>>>>>>>>> Then in the template you could refer to them as:
>>>>>>>>> 
>>>>>>>>> `NamedDocuments.users`,
>>>>>>>>> 
>>>>>>>>> and `NamedDocuments.groups`.
>>>>>>>>> 
>>>>>>>>> Use Case 1, and 2 can be unified into a coherent concept, where
>>>>>>>>> 
>>>>>>>>> `Document`
>>>>>>>>> 
>>>>>>>>> is just a shorthand for `NamedDocuments.main`. It's called "main"
>>>>>>>>> 
>>>>>>>>> because
>>>>>>>>> 
>>>>>>>>> that's "the" document the template is about, but then you have to
>>>>>>>>> added
>>>>>>>>> some helper documents, with symbolic names representing their role.
>>>>>>>>> 
>>>>>>>>> freemarker-cli
>>>>>>>>> -t access-report.ftl
>>>>>>>>> --document-name=main somewhere/foo-access-log.csv
>>>>>>>>> --document-name=users somewhere/foo-users.csv
>>>>>>>>> --document-name=groups somewhere/foo-groups.csv
>>>>>>>>> 
>>>>>>>>> Here, `Document` still works in the template, and it refers to
>>>>>>>>> `somewhere/foo-access-log.csv`. (While omitting
>> --document-name=main
>>>>>>>>> 
>>>>>>>>> above
>>>>>>>>> 
>>>>>>>>> would be cleaner, I couldn't figure out how to do that with
>> Picocli.
>>>>>>>>> Anyway, for now the point is the concept, which is not specific to
>>>>>>>>> 
>>>>>>>>> CLI.)
>>>>>>>>> 
>>>>>>>>> USE CASE 3
>>>>>>>>> 
>>>>>>>>> Here you have several of the same kind of documents. That has a
>> more
>>>>>>>>> generic sub-use-case, when you have explicitly named documents
>> (like
>>>>>>>>> "users" above), and for some you expect multiple input files.
>>>>>>>>> 
>>>>>>>>> freemarker-cli
>>>>>>>>> -t access-report.ftl
>>>>>>>>> --document-name=main somewhere/foo-access-log.csv
>>>>>>>>> somewhere/bar-access-log.csv
>>>>>>>>> --document-name=users somewhere/foo-users.csv
>>>>>>>>> somewhere/bar-users.csv
>>>>>>>>> --document-name=groups somewhere/global-groups.csv
>>>>>>>>> 
>>>>>>>>> The template must to be written with this use case in mind, as now
>> it
>>>>>>>>> 
>>>>>>>>> has
>>>>>>>>> 
>>>>>>>>> #list some of the documents. (I think in practice you hardly ever
>>>>>>>>> want
>>>>>>>>> 
>>>>>>>>> to
>>>>>>>>> 
>>>>>>>>> get a document by hard coded index. Either you don't know how many
>>>>>>>>> documents you have, so you can't use hard coded indexes, or you do,
>>>>>>>>> and
>>>>>>>>> each index has a specific meaning, but then you should name the
>>>>>>>>> 
>>>>>>>>> documents
>>>>>>>>> 
>>>>>>>>> instead, as using indexes is error prone, and hard to read.)
>>>>>>>>> Accessing that list of documents in the template, maybe could be
>> done
>>>>>>>>> 
>>>>>>>>> like
>>>>>>>>> 
>>>>>>>>> this:
>>>>>>>>> - For the "main" documents: `DocumentList`
>>>>>>>>> - For explicitly named documents, like "users":
>>>>>>>>> 
>>>>>>>>> `NamedDocumentLists.users`
>>>>>>>>> 
>>>>>>>>> SUMMING UP
>>>>>>>>> 
>>>>>>>>> To unify all 3 use cases into a coherent concept:
>>>>>>>>> - `NamedDocumentLists.<name>` is the most generic form, and while
>> you
>>>>>>>>> 
>>>>>>>>> can
>>>>>>>>> 
>>>>>>>>> achieve everything with it, using it requires your template to
>> handle
>>>>>>>>> 
>>>>>>>>> the
>>>>>>>>> 
>>>>>>>>> most generic case too. So, I think it would be rarely used.
>>>>>>>>> - `DocumentList` is just a shorthand for `NamedDocumentLists.main`.
>>>>>>>>> 
>>>>>>>>> It's
>>>>>>>>> 
>>>>>>>>> used if you only have one kind of documents (single format and
>>>>>>>>> schema),
>>>>>>>>> 
>>>>>>>>> but
>>>>>>>>> 
>>>>>>>>> potentially multiple of them.
>>>>>>>>> - `NamedDocuments.<name>` expresses that you expect exactly 1
>>>>>>>>> document
>>>>>>>>> 
>>>>>>>>> of
>>>>>>>>> 
>>>>>>>>> the given name.
>>>>>>>>> - `Document` is just a shorthand for `NamedDocuments.main`. This is
>>>>>>>>> for
>>>>>>>>> 
>>>>>>>>> the
>>>>>>>>> 
>>>>>>>>> most natural/frequent use case.
>>>>>>>>> 
>>>>>>>>> That's 4 possible ways of accessing your documents, which is a
>>>>>>>>> 
>>>>>>>>> trade-off
>>>>>>>>> 
>>>>>>>>> for the sake of these:
>>>>>>>>> - Catching CLI (or Maven, etc.) input where the template output
>>>>>>>>> likely
>>>>>>>>> 
>>>>>>>>> will
>>>>>>>>> 
>>>>>>>>> be wrong. That's only possible if the user can communicate its
>> intent
>>>>>>>>> 
>>>>>>>>> in
>>>>>>>>> 
>>>>>>>>> the template.
>>>>>>>>> - Users don't need to deal with concepts that are irrelevant in
>> their
>>>>>>>>> concrete use case. Just start with the trivial, `Document`, and
>> later
>>>>>>>>> 
>>>>>>>>> if
>>>>>>>>> 
>>>>>>>>> the need arises, generalize to named documents, document lists, or
>>>>>>>>> 
>>>>>>>>> both.
>>>>>>>>> 
>>>>>>>>> What do guys think?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best regards,
>>>>>> Daniel Dekany
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards,
>>>>> Daniel Dekany
>>>> 
>>>> 
>>> 
>>> --
>>> Best regards,
>>> Daniel Dekany
>> 
>> 
> 
> -- 
> Best regards,
> Daniel Dekany



Mime
View raw message