drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Le Dem <jul...@dremio.com>
Subject Re: select from table with options
Date Wed, 21 Oct 2015 22:43:26 GMT
I like the approach of using views to add missing metadata to an existing
raw dataset. The raw datasets stay the same and the view becomes the
interface to the data. Nothing is mutated and we know how things are
derived from one another.

TLDR: I'm trying to summarize the options bellow and add a few thoughts:
(please comment whether you think upside/downside elements are valid)
Let's refer to those options with the name in *bold*

- *urls style params* (my initial strawman): *select * from
dfs.`default`.`/path/to/file/something.psv?type=text&delimiter=|`;*
Extensibility is ensured through the storage plugin interpretation of the
path.
I agree with decomposing the syntax of format vs the data path/url. so this
would conflict with HTTP query parameters.
I will not pursue this one but I think it was a great conversation starter!

- *table functions*: *select *
from delimitedFile(dfs.`default`.`/path/to/file/something.psv`, '|')*
It sounds like these would have to be defined in the corresponding
StoragePlugin as they need intimate knowledge of the underlying storage.
Extensibility would come through the storage plugin defining those?
+1 on named parameter vs just positional.
The possible downside is a syntax that could be a little foreign to the
data analyst.

- *fake-o parameters*. *select *
from dfs.`default`.`/path/to/file/something.psv` where magicFieldDelimiter
= '|';*
I would be inclined to avoid using filters for something that changes what
the data looks like.
This could be unintuitive to users. (at least it feels that way to me)
Extensibility would come through the storage plugin defining predicate push
down rules?

- *WITH USING OPTIONS*:
In general I would feel more natural to me to put those options as part of
a select statement. the select statement can always be used in a WITH
clause.
Extensibility would come through the storage plugin receiving those
options? As the with statement applies to a full select statement with
potentially joins, how would we know where to aply those options?
We have two sub-options:
  - *(type = 'text' AND lineDelimiter = '\n')*: this seems similar to fake-o
parameters (above), same comment than above. *AND* seems out of place, for
example OR would be forbidden.
  - *{ type: "text", linedDelimiter = "\n"}*: The advantage of this is that
you can re-use the same syntax in the configuration file. This is a plus
for consistency. Users would figure out what works and would just have to
put it in the central configuration once done.

- *EXTEND WITH OPTIONS*: *SELECT FROM emp EXTEND [optional columns list]
OPTIONS (type 'text', lineDelimiter);*
What would be the column list in that case? Would it be awkward to use
EXTEND without a column list?
Extensibility would come through the storage plugin receiving those extend
options.
It sounds like they could be simply SELECT OPTIONS?

- *Specific syntax*: *select * FROM mydb.mytable*




*    TREAT AS TEXT      USING        LINE DELIMITER '\n'        AND FIELD
DELIMITER ','        AND TREAT FIRST ROW AS HEADERS*
Extensibility would come through the storage plugin defining a sub grammar
for those options.
Possibly this is harder to implement for the storage plugin implementor.
Upside is the user has a SQL like syntax to specify this (although, I've
never been fond of parts where SQL is trying to be plain English)


*Appendix*: :P
For what it's worth, here is how Pig does it: *LOAD 'data' [USING function]
[AS schema];*
- Just *LOAD '/path/to/my/file'* will use the default Loader (tab separated
values)
- adding a custom Loader* LOAD 'data' USING MyCustomLoader('param1',
'param2'); *is how you implement custom formats or remote locations (HBase,
Web service, ...)
So you can use the default loader with a different separator (*USING
PigStorage('\t')*) in parameter or write your own.
- The AS clause lets you set the schema. You could have a csv file without
header and define the names and types of each column that way.
Doc: https://pig.apache.org/docs/r0.15.0/basic.html#load



On Wed, Oct 21, 2015 at 8:57 AM, Julian Hyde <jhyde@apache.org> wrote:

> Whatever API is used to scan files from SQL, there will need to be a
> corresponding way to accomplish the same thing in a user interface.
> Probably a form with various fields, some of them with drop-boxes etc.
>
> And ideally a facility that samples a few hundred rows to deduce the
> probable field names and types and which fields are unique.
>
> I think that the UI is the true "user friendly" interface. A usage
> pattern might be for people to define a data source using in the UI,
> save it as a view, then switch to the command line to write queries on
> that view.
>
> There are other use cases similar to reading flies. For example you
> would like to read data from an HTTP URL. You might want to specify
> similar parameters for formats, compression, parsing, and parameters
> in the file URI that describe a family of partitioned files. A URL
> might allow push-down of filters, projects and sorts. But still you
> would want to specify formats, compression and parsing the same way as
> reading files.
>
> To me, this argues for decomposing the file scan syntax into pieces
> that can be re-used if you get data from places other than files.
>
> Julian
>
>
> On Wed, Oct 21, 2015 at 6:15 AM, Jacques Nadeau <jacques@dremio.com>
> wrote:
> >> This fourth is also least extensible and thus most disenfranchising for
> >> those outside the inner group.
> >>
> >> Table functions (hopefully) would be something that others could
> > implement.
> >
> > This is a brainstorm about a user apis...
> >
> > It isn't appropriate to shoot down ideas immediately in a brainstorm. It
> > has a chilling effect on other people presenting new ideas.
> >
> > User apis should be defined first with an end-user in mind. Consistency
> in
> > different contexts is also very important (note my expansion of the
> > discussion to ASCRIBE METADATA.)
> >
> > Your statement about extensibility has no technical backing. If you have
> a
> > concern like this, ask the proposer if they think that this could be done
> > in an extensible way. In this case I see a number of ways that this could
> > be done. Note the mention of the grammar switch in my initial proposal or
> > go review my outstanding patch for integrating JSON literals. In many
> ways,
> > I think this approach could be considered the most extensible and
> > expressive for a Drill extension developer.
> >
> > I hope others will continue to brainstorm on this thread.
>



-- 
Julien

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message