drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alfredo Serafini <ser...@gmail.com>
Subject Re: user Digest 25 Oct 2017 17:19:29 -0000 Issue 1282
Date Thu, 26 Oct 2017 10:06:08 GMT
Hi Paul, all

thanks for sharing informations and ideas

But, given that Drill does support LIMIT 0, perhaps some enterprising soul
> might want to build the DESCRIBE support on top of it. You’d need some SQL
> syntax (might already be in Calcite), then an operator that converts the
> internal batch schema to a table of Java objects, like is done for system
> options, etc.
>

yes, I agree this would be very useful!


> But, Drill’s claim to fame is schema-on-read, so Drill has no way to
> specify a schema at present. This would be another good weekend project.
> The storage plugin would look for a schema file at plan time, load it, and
> pass the schema information to the reader which would do the required
> conversions on read. The result would be that CSV files behave more like
> Parquet files (which do specify column type as well as name.)


inferring basic types from some row of data is possible in CSV, using some
simple heuristics: I have a small class to do that and having an idea of
where to plug it, it could be used as a starting point. A more solid
approach could be extended later, covering much more cases, as well as
elasticsearch does for example (when in schema-less mode) even for JSON.

Can anybody share a reference to the involved classes where tose naive type
inferences can be added? is it at calcite level maybe?

thanks very much for all the references,
Alfredo




> ---------- Messaggio inoltrato ----------
> From: Paul Rogers <progers@mapr.com>
> To: "user@drill.apache.org" <user@drill.apache.org>
> Cc:
> Bcc:
> Date: Wed, 25 Oct 2017 05:13:54 +0000
> Subject: Re: describe query support? (catalog metadata, etc)
> Hi All,
>
> Drill supports two forms of CSV files: those with headers and those
> without.
>
> When a file has headers (and Drill is configured to read them), you can
> get the schema using the metadata API in xDBC or the native Drill client.
> Do a LIMIT 0 and you’ll get back just the schema of Varchar columns
> inferred from the CSV headers.
>
> On the other hand, if you read the CSV file without headers, then you
> already know the schema: a single `columns` column that is a Varchar array
> — but I suppose that is not terribly useful...
>
> What Drill does not do is return a table-like result with the metadata;
> like what DESCRIBE would do. Instead, you have to use the xDBC metadata API
> to retrieve the information.
>
> But, given that Drill does support LIMIT 0, perhaps some enterprising soul
> might want to build the DESCRIBE support on top of it. You’d need some SQL
> syntax (might already be in Calcite), then an operator that converts the
> internal batch schema to a table of Java objects, like is done for system
> options, etc.
>
> The next issue is that if your CSV file looks like this:
>
> fred,47,1960-05-06,14.56
>
> It does not help much to find out that these are all Varchar columns.
> Would be nice to have some light-weight way of defining the type (not a
> cast; the cast loads the data as Varchar, then does a conversion, which can
> be slow.)
>
> But, Drill’s claim to fame is schema-on-read, so Drill has no way to
> specify a schema at present. This would be another good weekend project.
> The storage plugin would look for a schema file at plan time, load it, and
> pass the schema information to the reader which would do the required
> conversions on read. The result would be that CSV files behave more like
> Parquet files (which do specify column type as well as name.)
>
> - Paul
>
> > On Oct 24, 2017, at 6:21 AM, Charles Givre <cgivre@gmail.com> wrote:
> >
> > @Divya,
> > I’m aware of that, but there is no way as it stands to discover the
> schema of a file, CSV or otherwise.  I’m not talking about attempting to
> infer data types, just the simple number of columns, name of columns and
> what Drill thinks they are.  I know that you can do this via a series of
> queries, but for files, there is no equivalent of DESCRIBE.
> > That’s what I was getting at and I think that would be really useful to
> have.  Also it wouldn’t really alter the character of Drill.
> > — C
> >
> >
> >> On Oct 24, 2017, at 02:02, Divya Gehlot <divya.htconex@gmail.com>
> wrote:
> >>
> >> @Charles : when you read csv files can casting and get the data type as
> per
> >> your requirements
> >>
> >> Hope I answer what you asked for :)
> >>
> >> HTH
> >> Thanks,
> >> Divya
> >>
> >> On 20 October 2017 at 01:42, Charles Givre <cgivre@gmail.com> wrote:
> >>
> >>> Hi Alfredo,
> >>> When I was trying to get Drill to work with various BI tools, all I
> really
> >>> needed was a list of columns.  Data types would be a big bonus, but
> Drill
> >>> interprets CSV data as plain text anyway.  It would be really useful
> for
> >>> other file types where Drill does infer data types.
> >>>
> >>>
> >>> — C
> >>>
> >>>
> >>>> On Oct 19, 2017, at 6:13 AM, Alfredo Serafini <seralf@gmail.com>
> wrote:
> >>>>
> >>>> Hi thanks for the replies!
> >>>>
> >>>> @Chun yes using Views is an approach I considered, and I like it also
> >>>> methodologically, in order to have some change to "prepare" the data
> >>>> just a bit. I'm testing drill as a sort of data facade for tools which
> >>>> handles mappings to other context, so this could be helpful for me.
> >>>>
> >>>> Anyway I have some concerns regardings metadata/catalog support for
> >>>> views too: it seems that every view is saved on disk as a JSON file,
> >>>> then experimenting the same issues. Are you suggesting saving views
to
> >>>> some kind of relational database storage for staging purposes? Is that
> >>>> possible?
> >>>>
> >>>> Sorry for all the questions :-)
> >>>>
> >>>>
> >>>> @Charles yes Metabase (or Tableau, Superset, and so on...) is another
> >>>> use case in which it would be great to connect them to explore data
> >>>> with the capabilities of drill, and even for an initial exploration
of
> >>>> data since sometimes reducing the initial analysis phase time could
> >>>> help with development.
> >>>>
> >>>> For CSV it would be possible IMHO to guess types in a very basic way,
> >>>> at least using basic types and map columns to a text/String when a
> >>>> type can't be inferreed. It could be a starting point, and probably
> >>>> the more confortable case where to start for the (partial) support of
> >>>> catalog informations (JSON would be more complex, just to say). If
> >>>> there are standard interfaces that can be extended/implemented for
> >>>> filling them with those informations I'd like to do some
> >>>> experimentation on that, if it's not too complex to follow, and if
> >>>> someone can point me to a good place where to start for doing some
> >>>> experiments of a possible implementation, for the CSV case.
> >>>>
> >>>> Thanks for the comments, I appreciate them
> >>>>
> >>>> Alfredo
> >>>>
> >>>>
> >>>>
> >>>> I’d like to second Alfredo’s request.  I’ve been trying to get
Drill
> >>>> to work with some
> >>>>> open source visualization tools such as SqlPad and Metabase and
the
> >>> issue I keep running into
> >>>>> is that Drill doesn’t have a convenient way to describe how it
> >>> interprets flat files.  This
> >>>>> is really frustrating for me since this is my main use of Drill!
> >>>>> I wish the SELECT * FROM <data> LIMIT 0 worked in the RESTFul
> >>> interface.  In any event,
> >>>>> would be very useful to have some way to get Drill to describe how
it
> >>> will interpret a flat
> >>>>> file.
> >>>>> — C
> >>>>
> >>>>
> >>>>
> >>>>> On Oct 18, 2017, at 15:20, Chun Chang <cchang@mapr.com> wrote:
> >>>>>
> >>>>> There were discussions on the need of building a catalog for drill.
> But
> >>> I don't think
> >>>> that's the focus right now. And I am not sure the community will ever
> >>>> decide to go in that
> >>>> direction. For now, you best bet is to create views on top of your
> >>>> JSON/CSV data.
> >>>>>
> >>>>> ________________________________
> >>>>> From: Alfredo Serafini <seralf@gmail.com>
> >>>>> Sent: Wednesday, October 18, 2017 8:31:15 AM
> >>>>> To: user@drill.apache.org
> >>>>> Subject: describe query support? (catalog metadata, etc)
> >>>>>
> >>>>> Hi I'm experimenting using Drill as a data virtualization component
> via
> >>>>> JDBC and it generally works great for my needs.
> >>>>>
> >>>>> However some of the components connected via JDBC needs basic
> >>>>> metadata/catalog informations, and they seems to be missing for
JSON
> /
> >>> CSV
> >>>>> sources.
> >>>>>
> >>>>> For example the simple query
> >>>>>
> >>>>> DESCRIBE cp.`employee.json`;
> >>>>>
> >>>>> returns no results.
> >>>>>
> >>>>> Another possible example case could be when reading from an sqlite
> >>> source
> >>>>> containing the same data on an `employees` table
> >>>>> DESCRIBE `emploees`
> >>>>>
> >>>>> and still get no information: while this command is not directly
> >>> supported
> >>>>> in SQLite, an equivalent one could be for instance:
> >>>>> PRAGMA table_info(`employees`);
> >>>>>
> >>>>> but trying to execute it in Drill is not possible, as it is beyond
> the
> >>>>> supported standard SQL dialect.
> >>>>>
> >>>>> Moreover using a query like:
> >>>>> SELECT *
> >>>>> FROM INFORMATION_SCHEMA.COLUMNS
> >>>>> WHERE (TABLE_NAME='employees_view');
> >>>>>
> >>>>> on a view from the same data, seems to return the informations,
so I
> >>>>> suppose there should be a way to pass those informations to an
> >>>>> internal *DatabaseMetaData
> >>>>> <https://docs.oracle.com/javase/8/docs/api/java/sql/
> >>> DatabaseMetaData.html>*
> >>>>> implementation.
> >>>>> I wonder if there is such a component designed to manage all the
> catalog
> >>>>> informations for different sources?
> >>>>>
> >>>>> In this case it could adopt different strategies for retrieving
> >>> metadata,
> >>>>> depending on the case: for sqlite a different command / dialect
> could be
> >>>>> used, for CSV types could be guessed using simple heuristics, and
so
> on.
> >>>>> Probably cases like JSON would be much more complex, anyway.
> >>>>> Once the metadata have been retrieved for a source, I suppose the
> >>> standard
> >>>>> SQL dialect should work as expected.
> >>>>>
> >>>>>
> >>>>> Are there any plans to add catalog metadata support for various
> sources?
> >>>>> Does anybody have some workaround? for example using views or similar
> >>>>> approaches?
> >>>>>
> >>>>>
> >>>>> thanks in advance, sorry if the message is too long :-)
> >>>>> Alfredo
> >>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message