Moving the topic on non-relational data to this dedicated thread. First a
bit of context based on our use case:
* We want to do ad-hoc analyze data coming from diverse sources like APIs,
document stores, and relational stores.
* Data are not limited to relational structures, e.g. API returning complex
object collections.
* Data may change its structure over time, e.g. due to implementation
upgrades.
* We want to use high level declarative query languages such as SQL.
Various techniques exist to tackle non-relational data analysis such as
mapping to a relational schema or run custom code in a distributed compute
cluster (map-reduce, spark jobs, etc) on blob data. These have their
drawbacks like data latency and effort on structure transformation, and
query latency and cost computing on blob data.
We built a columnar data store for non-relational data without pre-defined
schema. For querying this data, technologies like Drill made it almost
possible to directly work with non-relational data using array and map data
types. However, we feel more can be done to truly make non-relational data
a first class citizen:
1) functions on array and map -- e.g. sizeOf(person.addresses) where
person.addresses is an array. Using FLATTEN is not the same as working
with complex objects directly,
2) heterogenous types -- better handling of heterogeneous data types within
the same column, e.g. product.version started as numbers, but some are
strings. Treating every value as a String is a workaround.
3) better storage plugin support for complex types -- we had to re-generate
from our columnar vectors into objects to give to Drill, rather than
feeding vectors directly.
I don't think any of these are easy to do. Much research and thinking will
be needed for a cohesive solution.
-- Jiang
|