drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Non-Relational Data
Date Fri, 17 Jan 2020 03:00:20 GMT
As a follow-up, to avoid us rehashing the same solutions we've discussed for some time, here
are just a few:

* Provide a very light-weight schema "hint" only for those columns that need the hint. In
the example, only provide schema for the ambiguous version column. If a column tends to have
lots of nulls, tell Drill what type it will be when data actually appears. Arina has a good
start on this approach.


* Populate the above automagically as queries run. (Ted has long advocated this one.)

* Populate the schema by converting existing specs, such as a Swagger spec.

* Make the existing UNION and LIST (repeated UNION) types work fully (whatever that means;
we'd need type conversion rules for all type combinations.)

* Add true dynamic types. Drill half-way supports vectors that hold Java objects. Make this
a first-class feature. Auto-convert conflicting fields. Also needs type conversion rules for
all combinations. (Might be more handy for special cases, to heavy-weight for the typical
schema-evolution case. Might be an easy way to handle complex structures such as images, complex
JSON structures, etc.)

* (Insert your ideas here.)


Point is, we're not short on ideas. What we need to know is what folks want; then we can pick
the idea that works best.


Thanks,
- Paul

 

    On Thursday, January 16, 2020, 6:48:22 PM PST, Paul Rogers <par0328@yahoo.com.invalid>
wrote:  
 
 Hi Jiang,

Thanks for taking the time to explain your use case. In my experience, the scenario you describe
is not unusual: the desire to integrate application data behind a common SQL interface. IMHO,
this is where Drill differs from the "data lake guys" such as Impala and Presto. Would be
helpful if you can help us get the solution right.


 You have hit upon one of the key challenges in making this model work: the question of how
to use SQL to work with a column that has a varying data type. Your product version use case
is a classic example.


In the non-relational world, types are messy. That's why we have Python: we can write code
that forces the values to one type or another, we can do "if" statements based on types, or
we can use "duck typing" to do dynamic method dispatch.


But, you want to use the convenience (and speed) of SQL. We want to hide these details somehow.
So we have to be clever.

Hive (then Presto, Impala and even Drill, when used with HMS) solved the problem by requiring
a schema stored in HMS. Simple idea, but HMS has become heavyweight. With HMS, we can declare
that our version field is a VARCHAR and conversions can be done at read time (schema-on-read.)
Problem solved. HMS works best for files. It does not work for the app integration use case:
reading data from ad-hoc files, accessing APIs and so on.

Schemas really help with distributed systems. Suppose I say SELECT a + b, SQL can work out
that a is an INT, b is a BIGINT and the result should be a BIGINT. Work can be distributed
across many independent nodes and all will make the same type decisions. Data can then be
merged and types will agree. Basic stuff. In fact, types are part of the very definition of
the relational theory on which SQL is based. 


But, if data types vary (no schema, as in Drill without HMS), things get messy. Drill can
distribute filters. Suppose I say WHERE version = 10. Drillbit 1 reads the INT-valued version
fields, Drillbit 2 reads the VARCHAR valued fields. How do we make sure that both nodes make
the same decisions?

Later, when data is merged, should there be a common type? What if I say ORDER BY version.
What is the intended result? Sort the INT values before (or after) VARCHAR? Convert VARCHAR
to INT (or visa-versa)?

Another factor is speed. Python is slow because it does dynamic type handling on every operation.
Java and C are fast because they are statically typed. Similarly, Impala is fast because of
static types. Drill tries to be fast by having a fixed vector type for each column. But, Drill
also tries to be flexible, where is when things start to get "interesting."

The crude-but-effective solution, without a schema, is to require users to explicitly include
the CAST and other statements in every query. It seems this was standard Hadoop practice before
HMS came along. But, since you want to use SQL, you presumably want to shield users from this
complexity.

See? You have nailed the key challenge we must solve to make the app integration idea work.


We can make up lots of things we *could* do. They pretty much fall into three buckets:

* Give the user or app developer a way to declare the type so all nodes do the same thing.
Declare the type either per-query (with a CAST) or per-table (using some kind of schema or
view.)

* Discover or impose a uniform set of conversion rules that work in all cases. (Always convert
to VARCHAR, say.)

* Embrace dynamic types. Add dynamic types to Drill to make it more Python-like, with "duck
typing" to decide, say, that "+" means different things for different type combinations. Accept
the resulting performance hit.


Before we brainstorm specific alternatives, it would be very helpful to understand what you
actually *want* to do. What was the intent when the type changed? How do you want to users
to work with such data in a SQL context? How would your ideal user experience look?


Thanks,
- Paul

 

    On Thursday, January 16, 2020, 4:47:01 PM PST, Jiang Wu <jiang.wu@mulesoft.com.invalid>
wrote:  
 
 Moving the topic on non-relational data to this dedicated thread.  First a
bit of context based on our use case:

* We want to do ad-hoc analyze data coming from diverse sources like APIs,
document stores, and relational stores.
* Data are not limited to relational structures, e.g. API returning complex
object collections.
* Data may change its structure over time, e.g. due to implementation
upgrades.
* We want to use high level declarative query languages such as SQL.

Various techniques exist to tackle non-relational data analysis such as
mapping to a relational schema or run custom code in a distributed compute
cluster (map-reduce, spark jobs, etc) on blob data.  These have their
drawbacks like data latency and effort on structure transformation, and
query latency and cost computing on blob data.

We built a columnar data store for non-relational data without pre-defined
schema.  For querying this data, technologies like Drill made it almost
possible to directly work with non-relational data using array and map data
types.  However, we feel more can be done to truly make non-relational data
a first class citizen:

1) functions on array and map -- e.g. sizeOf(person.addresses) where
person.addresses is an array.  Using FLATTEN is not the same as working
with complex objects directly,
2) heterogenous types -- better handling of heterogeneous data types within
the same column, e.g. product.version started as numbers, but some are
strings.  Treating every value as a String is a workaround.
3) better storage plugin support for complex types -- we had to re-generate
from our columnar vectors into objects to give to Drill, rather than
feeding vectors directly.

I don't think any of these are easy to do.  Much research and thinking will
be needed for a cohesive solution.

-- Jiang
    
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message