drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiang Wu <jiang...@mulesoft.com.INVALID>
Subject Re: Non-Relational Data
Date Fri, 17 Jan 2020 19:24:37 GMT
Hi Paul.  Thank you for all the analysis.  Let me digest the information
and i will come back with my views.

— Jiang

On Fri, Jan 17, 2020 at 11:02 AM Paul Rogers <par0328@yahoo.com.invalid>

> Hi All,
> One more follow up. Perhaps there are two cases here: 1) "normal" columns
> which have evolved over time from one type to another, and 2) data which is
> inherently variable and non-relational.
> One can make the case that the power/convenience of SQL comes from having
> a known set of types for "normal" columns, such as the product version
> example we discussed. The "data lake guys" use ETL (usually into Parquet)
> and HMS schemas to enforce and define the types. Clearly works for the data
> lake use case. Impala & Hive were successful for many years using this
> approach. Even Amazon does the same with Lambdas, Glue, etc.
> For Drill, we're talking data integration outside of a data lake. One can
> argue that known types are still useful. So, perhaps we need a light-weight
> way, perhaps part of the storage plugin configuration, to resolve
> ambiguities. That is, tell Drill, "infer the type of most columns, because
> they are obvious. But, for product version, treat it as a VARCHAR."
> Then, there is the separate problem of types which are inherently
> non-relational: arbitrary JSON structures, say. Here, it is unclear what it
> means to use SQL to process data which contains a Customer in one row, an
> Order in the next, and a Blog Post in the third. The key is to understand
> real use cases to see what is done in actual practice since the naive
> conclusion would be that SQL is not a good tool for such data.
> For the users out there, can you explain how you'd like this to work for
> your application?
> Thanks,
> - Paul
>     On Thursday, January 16, 2020, 07:00:20 PM PST, Paul Rogers <
> par0328@yahoo.com> wrote:
>  As a follow-up, to avoid us rehashing the same solutions we've discussed
> for some time, here are just a few:
> * Provide a very light-weight schema "hint" only for those columns that
> need the hint. In the example, only provide schema for the ambiguous
> version column. If a column tends to have lots of nulls, tell Drill what
> type it will be when data actually appears. Arina has a good start on this
> approach.
> * Populate the above automagically as queries run. (Ted has long advocated
> this one.)
> * Populate the schema by converting existing specs, such as a Swagger spec.
> * Make the existing UNION and LIST (repeated UNION) types work fully
> (whatever that means; we'd need type conversion rules for all type
> combinations.)
> * Add true dynamic types. Drill half-way supports vectors that hold Java
> objects. Make this a first-class feature. Auto-convert conflicting fields.
> Also needs type conversion rules for all combinations. (Might be more handy
> for special cases, to heavy-weight for the typical schema-evolution case.
> Might be an easy way to handle complex structures such as images, complex
> JSON structures, etc.)
> * (Insert your ideas here.)
> Point is, we're not short on ideas. What we need to know is what folks
> want; then we can pick the idea that works best.
> Thanks,
> - Paul
>     On Thursday, January 16, 2020, 6:48:22 PM PST, Paul Rogers
> <par0328@yahoo.com.invalid> wrote:
>  Hi Jiang,
> Thanks for taking the time to explain your use case. In my experience, the
> scenario you describe is not unusual: the desire to integrate application
> data behind a common SQL interface. IMHO, this is where Drill differs from
> the "data lake guys" such as Impala and Presto. Would be helpful if you can
> help us get the solution right.
>  You have hit upon one of the key challenges in making this model work:
> the question of how to use SQL to work with a column that has a varying
> data type. Your product version use case is a classic example.
> In the non-relational world, types are messy. That's why we have Python:
> we can write code that forces the values to one type or another, we can do
> "if" statements based on types, or we can use "duck typing" to do dynamic
> method dispatch.
> But, you want to use the convenience (and speed) of SQL. We want to hide
> these details somehow. So we have to be clever.
> Hive (then Presto, Impala and even Drill, when used with HMS) solved the
> problem by requiring a schema stored in HMS. Simple idea, but HMS has
> become heavyweight. With HMS, we can declare that our version field is a
> VARCHAR and conversions can be done at read time (schema-on-read.) Problem
> solved. HMS works best for files. It does not work for the app integration
> use case: reading data from ad-hoc files, accessing APIs and so on.
> Schemas really help with distributed systems. Suppose I say SELECT a + b,
> SQL can work out that a is an INT, b is a BIGINT and the result should be a
> BIGINT. Work can be distributed across many independent nodes and all will
> make the same type decisions. Data can then be merged and types will agree.
> Basic stuff. In fact, types are part of the very definition of the
> relational theory on which SQL is based.
> But, if data types vary (no schema, as in Drill without HMS), things get
> messy. Drill can distribute filters. Suppose I say WHERE version = 10.
> Drillbit 1 reads the INT-valued version fields, Drillbit 2 reads the
> VARCHAR valued fields. How do we make sure that both nodes make the same
> decisions?
> Later, when data is merged, should there be a common type? What if I say
> ORDER BY version. What is the intended result? Sort the INT values before
> (or after) VARCHAR? Convert VARCHAR to INT (or visa-versa)?
> Another factor is speed. Python is slow because it does dynamic type
> handling on every operation. Java and C are fast because they are
> statically typed. Similarly, Impala is fast because of static types. Drill
> tries to be fast by having a fixed vector type for each column. But, Drill
> also tries to be flexible, where is when things start to get "interesting."
> The crude-but-effective solution, without a schema, is to require users to
> explicitly include the CAST and other statements in every query. It seems
> this was standard Hadoop practice before HMS came along. But, since you
> want to use SQL, you presumably want to shield users from this complexity.
> See? You have nailed the key challenge we must solve to make the app
> integration idea work.
> We can make up lots of things we *could* do. They pretty much fall into
> three buckets:
> * Give the user or app developer a way to declare the type so all nodes do
> the same thing. Declare the type either per-query (with a CAST) or
> per-table (using some kind of schema or view.)
> * Discover or impose a uniform set of conversion rules that work in all
> cases. (Always convert to VARCHAR, say.)
> * Embrace dynamic types. Add dynamic types to Drill to make it more
> Python-like, with "duck typing" to decide, say, that "+" means different
> things for different type combinations. Accept the resulting performance
> hit.
> Before we brainstorm specific alternatives, it would be very helpful to
> understand what you actually *want* to do. What was the intent when the
> type changed? How do you want to users to work with such data in a SQL
> context? How would your ideal user experience look?
> Thanks,
> - Paul
>     On Thursday, January 16, 2020, 4:47:01 PM PST, Jiang Wu
> <jiang.wu@mulesoft.com.invalid> wrote:
>  Moving the topic on non-relational data to this dedicated thread.  First a
> bit of context based on our use case:
> * We want to do ad-hoc analyze data coming from diverse sources like APIs,
> document stores, and relational stores.
> * Data are not limited to relational structures, e.g. API returning complex
> object collections.
> * Data may change its structure over time, e.g. due to implementation
> upgrades.
> * We want to use high level declarative query languages such as SQL.
> Various techniques exist to tackle non-relational data analysis such as
> mapping to a relational schema or run custom code in a distributed compute
> cluster (map-reduce, spark jobs, etc) on blob data.  These have their
> drawbacks like data latency and effort on structure transformation, and
> query latency and cost computing on blob data.
> We built a columnar data store for non-relational data without pre-defined
> schema.  For querying this data, technologies like Drill made it almost
> possible to directly work with non-relational data using array and map data
> types.  However, we feel more can be done to truly make non-relational data
> a first class citizen:
> 1) functions on array and map -- e.g. sizeOf(person.addresses) where
> person.addresses is an array.  Using FLATTEN is not the same as working
> with complex objects directly,
> 2) heterogenous types -- better handling of heterogeneous data types within
> the same column, e.g. product.version started as numbers, but some are
> strings.  Treating every value as a String is a workaround.
> 3) better storage plugin support for complex types -- we had to re-generate
> from our columnar vectors into objects to give to Drill, rather than
> feeding vectors directly.
> I don't think any of these are easy to do.  Much research and thinking will
> be needed for a cohesive solution.
> -- Jiang

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message