drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jakub Kaczanowski" <jak...@adam.com.au>
Subject Re: Successful (and not so successful) Production use cases for drill?
Date Thu, 27 Aug 2020 07:02:40 GMT
Hi Paul

Thanks for the further info. This level of detail and warts and all
description is really hard to find for so many tools.

The "provided schema" feature looks interesting, and great that the
header/column names are used to map the data - and the implementation (fail
on data type conversion error) is what I think most people would expect

I guess this goes back to my original question; how are companies using
drill, and who in the company is using it?

Can drill stand alone and be user facing, used directly by analysts, or has
it found more success as a component of a larger pipeline?


It sounds like it's a good fit as a virtual layer over the top of various
data source types for quick and dirty adhoc queries, and as a singular data
source for an ETL process which models the data into a star schema and
pushes it into something that's faster for data viz. Does that sound right?

But yes, a global metadata store that can be used for data
lineage/provenance would be very useful to someone like me!


On 2020/08/25 07:32:06, Paul Rogers <p...@gmail.com> wrote: 
> Hi,> 
> 
> To answer the specific CSV questions:> 
> 
> Drill works with CSV files with or without headers. Without headers,
Drill> 
> use a positional array, which is fragile when columns change. With
headers,> 
> CSV uses column names, and so column order does not matter. Missing
columns> 
> are assumed to be VarChar, and so just show up as blank. Out of the box,> 
> however, Drill makes no attempt to guess column types (everything is> 
> VarChar).> 
> 
> There is a newer "provided schema" feature which can apply a schema to a> 
> CSV file. (Not all formats support the schema, but CSV is one that does.)>

> So, if you know some or all column types, you can spell those out using a>

> CREATE SCHEMA statement (or manually create the file) to handle types.> 
> 
> Drill supports views, with views represented as text (actually, I think> 
> JSON) files. So, no metadata; just extra text added to a SQL query. Any> 
> dependency or lineage information would need to be added on top. (Might
be> 
> a good feature to add to Drill.)> 
> 
> Although Drill has always said it can handle messy data, when you get
right> 
> down to it, Drill is no better (or worse) than any other tool: if you
give> 
> it garbage data (inconsistent column names, inconsistent types, etc.)> 
> you'll get garbage results. There is no magic.> 
> 
> CSV is pretty forgiving by itself. However, if you add types, (e.g. CAST
a> 
> column to an INT), and the data is instead "one", then the query will
fail.> 
> Drill does not attempt to discard the one record and continue. (That is> 
> called "data loss" and is a P1 error, especially if we were to discard
the> 
> one record that really mattered.)> 
> 
> Drill supports partitions (directories). Only queries that touch the file>

> will fail. If your data is in subdirectories, and you only query the> 
> "2020-02" directory, then bad data in the "2020-03" directory won't cause>

> the query to fail.> 
> 
> Interesting idea about doing ETL on each query. (That is, apply data> 
> clean-up rules to each record as read.) The problems are 1) it will be
slow> 
> (and we all know that the only three things people care about are 1)> 
> performance, 2) performance, and 3) performance), and 2) if the rules
miss> 
> anything, the user will just get a cryptic query failure with no good
data> 
> to help figure out which file and record caused the problem. (You can add>

> the into to a log message, but that is hardly user friendly.)> 
> 
> A hybrid solution is to load your data using Spark (or, at small scale,> 
> Python) to clean up the wacky formats. Standardize into a single file> 
> format. That is, classic ETL. For performance, that file wants to be> 
> Parquet, not CSV. (That is, if you have to do clean-up, use that> 
> opportunity to optimize for query performance on subsequent reads.) Then,>

> you can craft views on top of your clean data.> 
> 
> Drill provides both ODBC and JDBC drivers. Just about any tool that> 
> supports those interfaces should work with Drill.> 
> 
> Thanks,> 
> 
> - Paul> 
> 
> 
> On Mon, Aug 24, 2020 at 9:51 AM <hello@augerdata.com.au> wrote:> 
> 
> > Ha yes, big data is not quick data, especially if we’re talking about
the> 
> > psychological sub 5-8sec time limit to render a viz (so getting the
data> 
> > AND> 
> > making it pretty). Any longer and the experience is poor. Most MPP> 
> > platforms> 
> > take longer to compile the query plan like you say!> 
> >> 
> > As a developer/analyst/integrator I want to easily and cheaply ingest
flat> 
> > files (csvs) and relational data sources into a single relational layer>

> > that> 
> > supports complex SQL so I can write complex transformation views over
the> 
> > top with the option to materialise data for performance> 
> >> 
> > So there are 5 components;> 
> > Easily get disparate data source types (flat file/rdbms) into a single> 
> > 'logical' layer> 
> > Have a thick sql client gui app (not web based, but an installed .exe)
that> 
> > has support for the drill and some of the metadata operations so I can> 
> > quickly query the data, develop views, investigate technical issues,
bulk> 
> > generate DDL etc> 
> > OK speed - not real time data viz rendering speed, but waiting 30 sec
for a> 
> > count (*) from local1000row.csv is crap> 
> > Materialising data for improved performance - does drill have a
preferred> 
> > internal 'data format'? Or push data to a RDBMS via one of the plugins.
Is> 
> > this slow for bulk data volumes or does it leverage native bulk insert> 
> > syntax rather than piping individual single-row "insert into"
statements?> 
> > Bulk reading data out of drill via odbc/jdbc (eg to import into a in> 
> > memory/columnstore db for better data viz query responses)> 
> >> 
> >> 
> > I’m more focused on the traditional financial and business operational
data> 
> > rather than massive web logs or raw streaming data> 
> > Sources would be single files and tables in an rdbms up to 10M rows. A> 
> > total> 
> > “database” of under 1B rows.> 
> > REST APIs aren’t a priority> 
> >> 
> > How forgiving is the csv parser?> 
> > CSVs that slightly change month to month are my biggest manual time
sink.> 
> > Changes that to a human are trivial/inconsequential but fatal to a> 
> >> 
> > Eg can I have a "table" schema configured for a directory that has a
number> 
> > of csvs> 
> > Orders202001.csv has columns |OrderID|ProductCode|Quantity|Date|> 
> > Orders202002.csv has columns |OrderID|ProductCode|Date|Quantity|> 
> >> 
> > A single schema definition over the top of these two files should work> 
> > correctly for both - the ordinal position in the individual csvs is> 
> > dynamically mapped to the schema based on the csv header column> 
> >> 
> > The same applies to extra columns and missing columns> 
> >> 
> > Orders202003.csv has columns> 
> > |OrderID|ProductCode|SupplierCode|Quantity|Date|> 
> > Orders202004.csv has columns |ProductCode|Quantity|Date|> 
> >> 
> > Extra columns are ignored and missing columns show as null Throw a> 
> > 'warning'> 
> > or an option to set a column as mandatory in which case throw an error> 
> >> 
> > How does drill handle the situation where there are multiple csvs in a> 
> > directory and one fails but the rest are ok. Is the whole table offline?
Do> 
> > all selects fail or does it show what it knows and throws a warning?> 
> >> 
> > I've written a c# csv handler like above and use for ETLing into a> 
> > relational dbs when required. It saves so much time.> 
> >> 
> >> 
> > Is there a 3rd party SQL query tool that plays nicely with drill?> 
> >> 
> >> 
> > I do a lot of funky SQL with views on views and CTEs etc etc. How
accurate> 
> > is the dependency metadata? Would I be able to generate object level> 
> > (view/table) data lineage/dependency data?> 
> >> 
> > As an aside – I’ve seen some of the threads om the mailing list about> 
> > writing a generic rest plugin. I’ve previously used the CDATA -> 
> > https://www.cdata.com/drivers/rest/odbc/ (worth a download of the trial
to> 
> > check out for ideas imho) especially mapping output data and uri params>

> > http://cdn.cdata.com/help/DWF/odbc/pg_customschemacolumns.htm> 
> >> 
> > On 2020/08/21 04:55:54, Paul Rogers <p...@gmail.com> wrote:> 
> > > Hi, welcome to Drill.>> 
> > >> 
> > > In my (albeit limited) experience, Drill has a particular sweet spot:>

> > data>> 
> > > large enough to justify a distributed system, but not so large as to>>

> > > overtax the limited support Drill has for huge deployments.> 
> > Self-describing>> 
> > > data is good, but not data that is dirty or with inconsistent format.>

> > Drill>> 
> > > is good to grab data from other systems, but only if those systems
have>> 
> > > some way to "push" operations via a system-specific query API (and> 
> > someone>> 
> > > has written a Drill plugin.)>> 
> > >> 
> > > Drill tries to be really good with Parquet: but that is not a
"source">> 
> > > format; you'll need to ETL data into Parquet. Some have used Drill
for> 
> > the>> 
> > > ETL, but that only works if the source data is clean.>> 
> > >> 
> > > One of the biggest myths around big data is that you can get> 
> > interactive>> 
> > > response times on large data sets. You are entirely at the mercy of
I/O>> 
> > > performance. You can get more, but it will cost you. (In the "old
days"> 
> > by>> 
> > > having a very large number of disk spindles; today by having many
nodes>> 
> > > pull from S3.)>> 
> > >> 
> > > As your data size increases, you'll want to partition data (which is
as>> 
> > > close to indexing as Drill and similar tools get.) But, as the number>

> > of>> 
> > > partitions (or, for Parquet, row groups) increases, Drill will spend> 
> > more>> 
> >> 
> > > time figuring out which partitions & row groups to scan than it
spends>> 
> > > scanning the resulting files. The Hive Metastore tries to solve this,>

> > but>> 
> >> 
> > > has become a huge mess with its own problems.>> 
> > >> 
> > > From what I've seen, Drill works best somewhere in the middle: larger>

> > than>> 
> > > a set of files on your laptop, smaller than 10's of K of Parquet
files.>> 
> > >> 
> > > Might be easier to discuss *your* specific use case rather than
explain> 
> > the>> 
> > > universe of places where Drill has been used.>> 
> > >> 
> > > To be honest, I guess my first choice would be to run in the cloud> 
> > using>> 
> > > tools available from Amazon, DataBricks or Snowflake if you have a>>

> > > reasonably "normal" use case and just want to get up and running> 
> > quickly.>> 
> >> 
> > > If the use case turns out to be viable, you can find ways to reduce> 
> > costs>> 
> >> 
> > > by replacing "name brand" components with open source. But, if you> 
> > "failed>> 
> > > fast", you did so without spending much time at all on plumbing.>> 
> > >> 
> > > Thanks,>> 
> > >> 
> > > - Paul>> 
> > >> 
> > >> 
> > > On Thu, Aug 20, 2020 at 9:02 PM <he...@augerdata.com.au> wrote:>>

> > >> 
> > > > Hi all,>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > Can some of the users that have deployed drill in production,
whether>> 
> > > > small/medium and enterprise firms, share the use cases and> 
> > experiences?>> 
> >> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > What problems was drill meant to solve?>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > Was it successful?>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > What was/is drill mostly used for at your corporation?>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > What was tried but wasn't taken up by users?>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > Has it found a niche, or a core group of heavy users? What are
their> 
> > roles?>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > I've been working in reporting, data warehousing, business> 
> > intelligence,>> 
> > > > data engineering(?) (the name of the field seems to rebrand every 5
or> 
> > so>> 
> > > > years - or the lifecycle of 2 failed enterprise data projects - but>

> > that's>> 
> > > > a>> 
> > > > theory for another time) for a bit over 15 years now and for the
last 5> 
> > or>> 
> > > > so have been trying to understand why 70-80% of projects never> 
> > achieve>> 
> > > > their>> 
> > > > aims. It doesn't seem to matter if they're run by really smart
(and>> 
> > > > expensive!) people using best in class tools and processes. Their>

> > failure>> 
> > > > rate might be closer to the 70%, but that's still pretty terrible>>

> > > >>> 
> > > >>> 
> > > >>> 
> > > > I have a couple theories as to why and have tested them over the
last 5> 
> > or>> 
> > > > so years>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > One part is reducing the gap between project inception and
production>> 
> > > > quality data output. Going live quickly creates enthusiasm + a> 
> > feedback>> 
> >> 
> > > > loop>> 
> > > > to iterate the models which in turn creates a sense of engagement>>

> > > >>> 
> > > >>> 
> > > >>> 
> > > > Getting rid of a thick ETL process that takes months or more of dev>

> > and>> 
> >> 
> > > > refactoring before hitting production is one component. Using ~70%
of> 
> > the>> 
> > > > project resources on the plumbing - leaving very little for the> 
> > complex>> 
> >> 
> > > > data>> 
> > > > model iterations - just creates a tech demo not a commercially
useful>> 
> > > > solution.  I don't think this is a technology problem, and applies>

> > whether>> 
> > > > using traditional on prem etl tools or the current data engineering>

> > scripts>> 
> > > > and cron jobs but in the cloud>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > The least unsuccessful data engineering approach I've seen is the
ELT>> 
> > > > logical data mart pattern; landing the source data as close to a
1:1> 
> > format>> 
> > > > as possible into a relational-like data store and leveraging MPP
dbs> 
> > via>> 
> > > > views and CTASes to create a conformed star schema. Then using the>

> > star>> 
> >> 
> > > > schemas as building blocks create the complex (and actually useful)>

> > models.>> 
> > > > Something like this can be up in a few weeks and still cover the> 
> > majority>> 
> > > > of>> 
> > > > user facing features a full data pipeline/ETL would have (snapshots
+>> 
> > > > transactional facts, inferred members, type 1 dims only - almost> 
> > everyone>> 
> > > > double joins a type 2 dim to get the current record anyway). While>

> > they>> 
> >> 
> > > > aren't always (or even usually) 100% successes they at least have>

> > something>> 
> > > > useful or just fail quickly which is useful in itself>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > The first part of this - getting all the data into a single spot,>

> > still>> 
> >> 
> > > > sucks and is probably more fiddly than 10 years ago because it's
all> 
> > flat>> 
> > > > files and apis now vs on premise db->db transfers>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > > This is where I *think* drill might help me, but just want to check
if> 
> > this>> 
> > > > is how it's actually being used by others. It would be nice if it>

> > could>> 
> >> 
> > > > replace the MPP altogether..>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > > >>> 
> > >> 
> >> 
> >> 
> 


Mime
View raw message