drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@gmail.com>
Subject Re: Successful (and not so successful) Production use cases for drill?
Date Fri, 21 Aug 2020 04:55:54 GMT
Hi, welcome to Drill.

In my (albeit limited) experience, Drill has a particular sweet spot: data
large enough to justify a distributed system, but not so large as to
overtax the limited support Drill has for huge deployments. Self-describing
data is good, but not data that is dirty or with inconsistent format. Drill
is good to grab data from other systems, but only if those systems have
some way to "push" operations via a system-specific query API (and someone
has written a Drill plugin.)

Drill tries to be really good with Parquet: but that is not a "source"
format; you'll need to ETL data into Parquet. Some have used Drill for the
ETL, but that only works if the source data is clean.

One of the biggest myths around big data is that you can get interactive
response times on large data sets. You are entirely at the mercy of I/O
performance. You can get more, but it will cost you. (In the "old days" by
having a very large number of disk spindles; today by having many nodes
pull from S3.)

As your data size increases, you'll want to partition data (which is as
close to indexing as Drill and similar tools get.) But, as the number of
partitions (or, for Parquet, row groups) increases, Drill will spend more
time figuring out which partitions & row groups to scan than it spends
scanning the resulting files. The Hive Metastore tries to solve this, but
has become a huge mess with its own problems.

>From what I've seen, Drill works best somewhere in the middle: larger than
a set of files on your laptop, smaller than 10's of K of Parquet files.

Might be easier to discuss *your* specific use case rather than explain the
universe of places where Drill has been used.

To be honest, I guess my first choice would be to run in the cloud using
tools available from Amazon, DataBricks or Snowflake if you have a
reasonably "normal" use case and just want to get up and running quickly.
If the use case turns out to be viable, you can find ways to reduce costs
by replacing "name brand" components with open source. But, if you "failed
fast", you did so without spending much time at all on plumbing.

Thanks,

- Paul


On Thu, Aug 20, 2020 at 9:02 PM <hello@augerdata.com.au> wrote:

> Hi all,
>
>
>
> Can some of the users that have deployed drill in production, whether
> small/medium and enterprise firms, share the use cases and experiences?
>
>
>
> What problems was drill meant to solve?
>
>
>
> Was it successful?
>
>
>
> What was/is drill mostly used for at your corporation?
>
>
>
> What was tried but wasn't taken up by users?
>
>
>
> Has it found a niche, or a core group of heavy users? What are their roles?
>
>
>
>
>
> I've been working in reporting, data warehousing, business intelligence,
> data engineering(?) (the name of the field seems to rebrand every 5 or so
> years - or the lifecycle of 2 failed enterprise data projects - but that's
> a
> theory for another time) for a bit over 15 years now and for the last 5 or
> so have been trying to understand why 70-80% of projects never achieve
> their
> aims. It doesn't seem to matter if they're run by really smart (and
> expensive!) people using best in class tools and processes. Their failure
> rate might be closer to the 70%, but that's still pretty terrible
>
>
>
> I have a couple theories as to why and have tested them over the last 5 or
> so years
>
>
>
> One part is reducing the gap between project inception and production
> quality data output. Going live quickly creates enthusiasm + a feedback
> loop
> to iterate the models which in turn creates a sense of engagement
>
>
>
> Getting rid of a thick ETL process that takes months or more of dev and
> refactoring before hitting production is one component. Using ~70% of the
> project resources on the plumbing - leaving very little for the complex
> data
> model iterations - just creates a tech demo not a commercially useful
> solution.  I don't think this is a technology problem, and applies whether
> using traditional on prem etl tools or the current data engineering scripts
> and cron jobs but in the cloud
>
>
>
> The least unsuccessful data engineering approach I've seen is the ELT
> logical data mart pattern; landing the source data as close to a 1:1 format
> as possible into a relational-like data store and leveraging MPP dbs via
> views and CTASes to create a conformed star schema. Then using the star
> schemas as building blocks create the complex (and actually useful) models.
> Something like this can be up in a few weeks and still cover the majority
> of
> user facing features a full data pipeline/ETL would have (snapshots +
> transactional facts, inferred members, type 1 dims only - almost everyone
> double joins a type 2 dim to get the current record anyway). While they
> aren't always (or even usually) 100% successes they at least have something
> useful or just fail quickly which is useful in itself
>
>
>
> The first part of this - getting all the data into a single spot, still
> sucks and is probably more fiddly than 10 years ago because it's all flat
> files and apis now vs on premise db->db transfers
>
>
>
> This is where I *think* drill might help me, but just want to check if this
> is how it's actually being used by others. It would be nice if it could
> replace the MPP altogether..
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message