asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Li <>
Subject Re: Asterix Schema Provider Framework
Date Wed, 30 Dec 2015 19:05:20 GMT
Sounds very interesting.  A basic question about "inference."  Is the
inferred schema unique?  In other words, is it possible to get two
schemas from the same instance, especially considering open types and
close types?


On Fri, Dec 25, 2015 at 3:20 PM, Wail Alkowaileet <> wrote:
> Dears Dev,
> First of all, Happy Holidays :)
> I want to share with you my latest work on AsterixDB, Asterix Schema
> Provider Framework.
> The design document will be shared soon once I fully integrate it with the
> new Asterix Messaging Framework.
> Summary:
> The main aim of the Schema Provider Framework is to help the user to
> understand the schema of the query result.
> Motivation:
> I'm currently working on building AsterxDB-Spark connector. Spark works with
> JSON perfectly, however, it has to scan the whole result to infer the
> schema. To prevent Spark from doing this pass, Asterix can infer the schema
> while materializing the result.
> Additionally, Asterix users can get the schema information in a
> Thrift/ADM-like format which can help them to build the required classes to
> deserialize the result on their code.
> Brief description of how it works:
> Once the user ask for the schema to be inferred, the schema builder will
> follow the result printer (APrinterVisitor) to build up the information
> about the records, lists and fields types. Then it will compute the final
> schema (union) of the resulting output in a single pass.
> User-model:
> To see the "tentative" of the user-model, please check the doc:
> Also see the attached images for screenshots of the web-gui interface
> including the resulting schema.
> Future "Ambitious" Applications:
> One low-hanging-fruit application is to extend Asterix open/closed to
> include yet another type called "inferred".
> inferred types will ask Asterix to build the schema information on
> ingestion. Inferred types can be very helpful, at least when you have a
> schema looks like one of our datasets (see attached wosType.adm) where you
> can have multiple fields with similar names and different "schemas" or
> nested types.
> inferred type is a hybrid type (closed and open) which can have the
> flexibility of the open type and close performance and storage footprint of
> the closed type.
> Probably inferred type is good for read-intensive application. For
> write-intensive where every CPU cycle counts, this can introduce some
> unnecessary overhead. But probably there is a clever solution with some
> adaptive sampling techniques.
> I'll be investigating more about this and share my thoughts later on :-))
> Have a wonderful holiday and happy weekend!
> --
> Regards,
> Wail Alkowaileet

View raw message