drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neeraja Rentachintala <nrentachint...@maprtech.com>
Subject Re: select from table with options
Date Thu, 22 Oct 2015 00:02:15 GMT
can you elaborate on what you mean by .drill is a different use case.

In my mind, .drill has 2 use cases - a way to specify hints to Drill on how
read  certain datasets (and potentially optimize the queries on the
datasets) and a way to save the definitions of objects created via Drill
for reuse/access from BI tools. Both these (i.e existing or external tables
vs Drill created or internal tables) currently are not differentiated in
Drill, hence I believe can use the same model in terms of metadata
handling.

I would be interested in knowing your thoughts.
-Neeraja

On Wed, Oct 21, 2015 at 4:55 PM, Julien Le Dem <julien@dremio.com> wrote:

> I think of .drill files as a different use case but there is potentially
> some overlap.
> Some things to keep in mind:
>  - The person analyzing the data has often read-only access.
>  - having to write a config file on one end of the system and then query on
> the other end is not analyst friendly
>
> We should definitely keep .drill in mind while design this.
> Although I'm thinking we should probably discuss .drill on a separate
> thread.
>
>
>
> On Wed, Oct 21, 2015 at 3:55 PM, Neeraja Rentachintala <
> nrentachintala@maprtech.com> wrote:
>
> > Another alternative to do this to specify a metadata file (.drill files)
> > that came up in some of the earlier discussions to solve similar use
> cases.
> > Rather than centrally defining configurations in storage plugin (which is
> > what Drill does today), .drill files allow more granularity , potentially
> > at folder or individual file which will override the central
> configuration.
> >
> > I think the benefit of the metadata file is it can be used for other
> > purposes (such as to stats etc). Another benefit is that if you are
> using a
> > BI/query tool to trigger Drill SQL queries, this will work seamlessly
> > rather than having to rewrite the query for custom syntax.
> >
> > I would like to know what others think of this approach.
> >
> > -Neeraja
> >
> >
> >
> > On Wed, Oct 21, 2015 at 3:43 PM, Julien Le Dem <julien@dremio.com>
> wrote:
> >
> > > I like the approach of using views to add missing metadata to an
> existing
> > > raw dataset. The raw datasets stay the same and the view becomes the
> > > interface to the data. Nothing is mutated and we know how things are
> > > derived from one another.
> > >
> > > TLDR: I'm trying to summarize the options bellow and add a few
> thoughts:
> > > (please comment whether you think upside/downside elements are valid)
> > > Let's refer to those options with the name in *bold*
> > >
> > > - *urls style params* (my initial strawman): *select * from
> > > dfs.`default`.`/path/to/file/something.psv?type=text&delimiter=|`;*
> > > Extensibility is ensured through the storage plugin interpretation of
> the
> > > path.
> > > I agree with decomposing the syntax of format vs the data path/url. so
> > this
> > > would conflict with HTTP query parameters.
> > > I will not pursue this one but I think it was a great conversation
> > starter!
> > >
> > > - *table functions*: *select *
> > > from delimitedFile(dfs.`default`.`/path/to/file/something.psv`, '|')*
> > > It sounds like these would have to be defined in the corresponding
> > > StoragePlugin as they need intimate knowledge of the underlying
> storage.
> > > Extensibility would come through the storage plugin defining those?
> > > +1 on named parameter vs just positional.
> > > The possible downside is a syntax that could be a little foreign to the
> > > data analyst.
> > >
> > > - *fake-o parameters*. *select *
> > > from dfs.`default`.`/path/to/file/something.psv` where
> > magicFieldDelimiter
> > > = '|';*
> > > I would be inclined to avoid using filters for something that changes
> > what
> > > the data looks like.
> > > This could be unintuitive to users. (at least it feels that way to me)
> > > Extensibility would come through the storage plugin defining predicate
> > push
> > > down rules?
> > >
> > > - *WITH USING OPTIONS*:
> > > In general I would feel more natural to me to put those options as part
> > of
> > > a select statement. the select statement can always be used in a WITH
> > > clause.
> > > Extensibility would come through the storage plugin receiving those
> > > options? As the with statement applies to a full select statement with
> > > potentially joins, how would we know where to aply those options?
> > > We have two sub-options:
> > >   - *(type = 'text' AND lineDelimiter = '\n')*: this seems similar to
> > > fake-o
> > > parameters (above), same comment than above. *AND* seems out of place,
> > for
> > > example OR would be forbidden.
> > >   - *{ type: "text", linedDelimiter = "\n"}*: The advantage of this is
> > that
> > > you can re-use the same syntax in the configuration file. This is a
> plus
> > > for consistency. Users would figure out what works and would just have
> to
> > > put it in the central configuration once done.
> > >
> > > - *EXTEND WITH OPTIONS*: *SELECT FROM emp EXTEND [optional columns
> list]
> > > OPTIONS (type 'text', lineDelimiter);*
> > > What would be the column list in that case? Would it be awkward to use
> > > EXTEND without a column list?
> > > Extensibility would come through the storage plugin receiving those
> > extend
> > > options.
> > > It sounds like they could be simply SELECT OPTIONS?
> > >
> > > - *Specific syntax*: *select * FROM mydb.mytable*
> > >
> > >
> > >
> > >
> > > *    TREAT AS TEXT      USING        LINE DELIMITER '\n'        AND
> FIELD
> > > DELIMITER ','        AND TREAT FIRST ROW AS HEADERS*
> > > Extensibility would come through the storage plugin defining a sub
> > grammar
> > > for those options.
> > > Possibly this is harder to implement for the storage plugin
> implementor.
> > > Upside is the user has a SQL like syntax to specify this (although,
> I've
> > > never been fond of parts where SQL is trying to be plain English)
> > >
> > >
> > > *Appendix*: :P
> > > For what it's worth, here is how Pig does it: *LOAD 'data' [USING
> > function]
> > > [AS schema];*
> > > - Just *LOAD '/path/to/my/file'* will use the default Loader (tab
> > separated
> > > values)
> > > - adding a custom Loader* LOAD 'data' USING MyCustomLoader('param1',
> > > 'param2'); *is how you implement custom formats or remote locations
> > (HBase,
> > > Web service, ...)
> > > So you can use the default loader with a different separator (*USING
> > > PigStorage('\t')*) in parameter or write your own.
> > > - The AS clause lets you set the schema. You could have a csv file
> > without
> > > header and define the names and types of each column that way.
> > > Doc: https://pig.apache.org/docs/r0.15.0/basic.html#load
> > >
> > >
> > >
> > > On Wed, Oct 21, 2015 at 8:57 AM, Julian Hyde <jhyde@apache.org> wrote:
> > >
> > > > Whatever API is used to scan files from SQL, there will need to be a
> > > > corresponding way to accomplish the same thing in a user interface.
> > > > Probably a form with various fields, some of them with drop-boxes
> etc.
> > > >
> > > > And ideally a facility that samples a few hundred rows to deduce the
> > > > probable field names and types and which fields are unique.
> > > >
> > > > I think that the UI is the true "user friendly" interface. A usage
> > > > pattern might be for people to define a data source using in the UI,
> > > > save it as a view, then switch to the command line to write queries
> on
> > > > that view.
> > > >
> > > > There are other use cases similar to reading flies. For example you
> > > > would like to read data from an HTTP URL. You might want to specify
> > > > similar parameters for formats, compression, parsing, and parameters
> > > > in the file URI that describe a family of partitioned files. A URL
> > > > might allow push-down of filters, projects and sorts. But still you
> > > > would want to specify formats, compression and parsing the same way
> as
> > > > reading files.
> > > >
> > > > To me, this argues for decomposing the file scan syntax into pieces
> > > > that can be re-used if you get data from places other than files.
> > > >
> > > > Julian
> > > >
> > > >
> > > > On Wed, Oct 21, 2015 at 6:15 AM, Jacques Nadeau <jacques@dremio.com>
> > > > wrote:
> > > > >> This fourth is also least extensible and thus most
> disenfranchising
> > > for
> > > > >> those outside the inner group.
> > > > >>
> > > > >> Table functions (hopefully) would be something that others could
> > > > > implement.
> > > > >
> > > > > This is a brainstorm about a user apis...
> > > > >
> > > > > It isn't appropriate to shoot down ideas immediately in a
> brainstorm.
> > > It
> > > > > has a chilling effect on other people presenting new ideas.
> > > > >
> > > > > User apis should be defined first with an end-user in mind.
> > Consistency
> > > > in
> > > > > different contexts is also very important (note my expansion of the
> > > > > discussion to ASCRIBE METADATA.)
> > > > >
> > > > > Your statement about extensibility has no technical backing. If you
> > > have
> > > > a
> > > > > concern like this, ask the proposer if they think that this could
> be
> > > done
> > > > > in an extensible way. In this case I see a number of ways that this
> > > could
> > > > > be done. Note the mention of the grammar switch in my initial
> > proposal
> > > or
> > > > > go review my outstanding patch for integrating JSON literals. In
> many
> > > > ways,
> > > > > I think this approach could be considered the most extensible and
> > > > > expressive for a Drill extension developer.
> > > > >
> > > > > I hope others will continue to brainstorm on this thread.
> > > >
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
>
>
>
> --
> Julien
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message