drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Givre <cgi...@gmail.com>
Subject Re: [DISCUSSION] current project state
Date Tue, 14 Aug 2018 03:37:20 GMT
I’d like to weigh in here as well. As a long time user of Drill, I really would like to see
more people using it and I think there are a few key aspects that could really help on that
front. 

The first of which is the Arrow integration.  I’m not enough of a software engineer to understand
all the internal details here, but as I understand it, the promise of Arrow is that many tools
will share a common memory model and that it will be possible to transfer data from one tool
to the other without having to serialize/deserialize the data.  In the data science community
many of the major platforms, Python-pandas, R, and Spark are moving or have adopted Arrow.
 
Drill’s strength is the ease that it can query many different data sources and if Drill
were to adopt Arrow, I suspect that many people would adopt it as a part of a machine learning
pipeline.  Just recently, I attempted to do some data manipulation using Spark, and couldn’t
help but notice how difficult ti was in contrast with Drill. I’m sure this is a very complex
task, but I do think that it could be worth it in the end. 

Secondly, I’d like to second Paul’s call to simplify the interfaces for UDFs, Format and
ideally storage plugins.  A core strength of Drill is its extensibility and making it easier
would be a great thing.  I was wondering whether it would be possible or even a good idea,
to enable users to write UDFs in a scripting language such as python. 

Thirdly, 
i would really like to see us add more functionality to Drill.  @Arina, your work to build
a storage plugin for ElasticSearch is really great and I think more capabilities like that
are really needed.  I’d like to see a generic HTTP storage plugin, a storage plugin for
Google Sheets,  If I can figure out how storage plugins work, I’ll gladly work on some of
these. 

Just my .02.
— C





> On Aug 13, 2018, at 21:21, Paul Rogers <par0328@yahoo.com.INVALID> wrote:
> 
> Hi Arina,
> 
> Another topic would be whether/how to round out Drill's data model. Drill's scalar and
nullable types are pretty solid. Great work was done recently for Decimal (though the old
types still remain.) Good support is now available for nested types to do implicit joins to
produce SQL-friendly flat records. 
> But, opportunities for improvement still remain. Date/Time has timezone issues. Union,
List and Repeated List never quite worked. There are a few types identified in the code, but
not implemented (dates with TZ, tiny ints, etc.) How should Drill bridge. the gap from arrays
and maps (really, structs) on the one hand, and plain-old-relational ODBC/JDBC/BI tools on
the other?
> 
> Would be good to finalize the data types and their mapping to plain SQL: either keep
a type and make it fully work if it has holes, or drop it. Unions and Lists are the messiest.
They are incomplete in part, because they are trying to do the impossible: to predict the
future well enough that Drill can handle columns with varying or ambiguous data types (that
is, to handle schema changes.) Is there a better way to handle this issue (such as with metadata
hints)? That is, rather than fight with conflicting types at run time, simply declare the
common type in metadata so all operators and record batches agree on the type.
> 
> And, of course, there is the lingering issue of Drill vectors vs. Arrow. Arrow did great
work in metadata, but seems to have kept some of the awkward aspects of Drill's original memory
model (lack of control over batch sizes, ability to fragment memory.) Might there be a resyncing
of the two projects: Drill picks up Arrow's metadata and APIs, Arrow picks up Drill's memory
improvements, such as the size-limiting "result set loader" framework.
> 
> Big-picture issues such as this tend to get lost in the 2270 open Jira tickets. How might
the project create some "theme" tickets (or Wiki pages or whatever) to help pull the main
issues out of the wealth of detail in Jira?
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers <par0328@yahoo.com>
wrote:  
> 
> Hi Arina,
> 
> Thanks for launching this discussion. A few minor suggestions.
> 
> The developers have done a fantastic job stabilizing and improving Drill's core functionality.
Now the opportunity is to expand the use cases for Drill so that it gets wider adoption within
the community. Drill competes for mindshare with Impala, Presto, Hive, Spark and others. A
key differentiator for Drill can be the ability to extend the core and integrate Drill into
user applications. Of these tools, only Spark has a fully ostensible model. Can Drill provide
some of the flexibility that has powered Spark to success?
> 
> 1. You mentioned the metastore is under active investigation. Anything yet to share?
Didn't see any activity on the JIRA ticket. Metadata is a key gap in Drill. Simply adding
a Hive-like metastore would repeat the very errors that Drill was meant to address. Maybe
we can toss around ideas for a metadata API that provides greater flexibility.
> 
> 2. Users can extend the core with custom UDFs, storage engines, formats and so on. At
present, the code to do this is rather hard to write, debug and maintain. Is there value in
streamlining those interfaces so that a wider audience can extend Drill for their specific
needs?
> 
> 3. Similarly, we've seen interest in integrating Drill with other systems, which suggests
an opportunity for improved APIs. Ability to associate options, defaults and restrictions
with users. Ability to use the REST API for larger data sets and with stateful session options.
And so on.
> 
> Such extensions are best guided by user demands: what can Drill provide for production
applications to enable simpler/faster/more complete integration?  
> 
> Thanks,
> 
> - Paul
> 
> 
> 
>    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva <arina@apache.org>
wrote:  
> 
> Hi all,
> 
> as a new PMC Chair I would like to thank users for choosing and using
> Apache Drill and contributors /  committers for making improvements and
> fixes. Recently Apache Drill 1.14 was released bundled up with many
> improvements and new features. Please feel free to try it out and share
> your experience. As always we would love to hear your success stories of
> using Apache Drill.
> 
> Also I encourage users to share any problems found in Drill, as well as any
> suggestions for future improvements. Feel free to start discussion on the
> mailing list and then file a Jira with the summary. Contributions are
> always welcome: minor, major, doc improvements or grammar fixes. Just file
> a Jira and open the PR. Do not hesitate to ping developers on the mailing
> list if PR is not being timely reviewed.
> 
> Latest project reports show:
> Apache Drill project has healthy release schedule, each release includes
> lots of features.
> Mailing list (user / dev) are getting substantial support from the active
> developers, including Stackoverflow and Twitter.
> New committers are added on the steady basis.
> 
> Overall project is growing and moving forward. There have been discussions
> about Drill 2.0 last year and currently Drill metastore feature is under
> active investigation which might the breaking change for 2.0.
> 
> Please feel free to reply to this email with your comments / concerns /
> ideas about current project state.
> 
> Kind regards,
> Arina


Mime
View raw message