drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Givre <cgi...@gmail.com>
Subject Re: About integration of drill and arrow
Date Fri, 10 Jan 2020 20:30:42 GMT
Hi Igor,
Thanks for your thoughts.  I'm a little swamped today, but send a response over the weekend.
 Perhaps would you and your team in the Ukraine be interested in doing a virtual get together
to discuss further?  I'm based on eastern time in the US, so it's a little more convenient
than California time.
Thanks,
-- C






> On Jan 10, 2020, at 6:47 AM, Igor Guzenko <ihor.huzenko.igs@gmail.com> wrote:
> 
> ---------- Forwarded message ---------
> From: Igor Guzenko <ihor.huzenko.igs@gmail.com>
> Date: Fri, Jan 10, 2020 at 1:46 PM
> Subject: Re: About integration of drill and arrow
> To: dev <dev@drill.apache.org>
> 
> 
> Hello Drill Developers and Drill Users,
> 
> This discussion started as migration to Arrow but uncovered questions of
> strategical plans for moving towards Apache Drill 2.0.
> Below are my personal thoughts of what we, as developers, should do to
> offer Drill users better experience:
> 
> 1. High performant bulk insertions into as many data sources as possible.
> There is a whole bunch of different tools for data pipelining to use...
> But why people who know SQL should spend time learning something new for
> simply moving data between tools?
> 
> 2. Improve the efficiency of memory management (EVF, resource management,
> improved costs planning using meta store, etc.). Since we're dealing with
> big data alongside other tools installed on data nodes we should utilize
> memory very economically and effectively.
> 
> 3. Make integration with all other tools and formats as stable as possible.
> The high amount of bugs in the area tells that we have lots to improve.
> Every user is happy when he gets a tool and it simply works as expected.
> Also, analyze user requirements and provide integration with new most
> popular tools.  Querying high variety of
> data sources were and still one of the biggest selling points.
> 
> 4. Make code highly extensible and extremely friendly for contributions. No
> one would want to spend years of learning to make a contribution. This is
> why I want to see a lot of modules that are highly cohesive and define
> clear APIs for interaction with each other. This is also about paying old
> technical debts related to fat JDBC client, copy of web server in Drill on
> YARN, mixing everything in exec module, etc.
> 
> 5. Focus on performance improvements of every component, from query
> planning to execution.
> 
> These are my thoughts from developer's perspective. Since I'm just
> developer from Ukraine and far far away from Drill users, I believe that
> Charles Givre is the one who can build a strong Drill user community and
> collect their requirements for us.
> 
> 
> What relates to Volodymyr's suggestion about adapting Arrow and Drill
> vectors to work together (the same step is required to implement an Arrow
> client, suggested by Paul).
> I'm totally against the idea because it brings a huge amount of unnecessary
> complexity just to uncover small insides into the integration. First is
> that this is against the whole idea of Arrow since the main idea of Arrow
> is to provide unified columnar memory layout between different tools
> without any data conversions. But the step exactly requires data
> conversions, at least our nullability vector and their validity bitmaps are
> not the same, also Dict vector and their meaning of Dict may also cause
> data conversion.
> Another waste is the difference in metadata contracts, who knows whether
> it's even possible to combine them. Another problem, like I already
> mentioned is the huge complexity of the work,
> To do the work I should overcome all underlying pitfalls of both projects,
> in addition, I should cover all the untestable code with a comprehensive
> amount of tests to show that back and forth conversion is done correctly
> for every single unit of data in both vectors. The idea of adapters and
> clients is about 4 years old or more and no one did practical work to
> implement it. I think I explained why.
> 
> What I really like in Volodymyr's and Paul's suggestions is that we can
> extract clear API from existing EVF implementation and in practice provide
> Arrow or any other implementation for it. Who knows, maybe with new
> improved garbage collectors using direct memory is not necessary at all? It
> is quite clear what we need the middle layer between operators and memory,
> we need extensive benchmarks over the layer and experiments to show what is
> the best underlying memory for Drill.
> 
> What about client tools compatibility there is only one solution I can see
> is to provide new clients for Drill 2.0, although I agree that this is a
> tremendous amount of work there is no other way for making major steps into
> the future. Without it, we should lay back and watch while Drill is slowly
> dying and giving up to its competitors.
> 
> NOTE: I want to encourage everyone to join the discussion and share vision
> of what should be included in Drill 2.0 and what are strategic points we
> want to achieve in the future.
> 
> Kind regards,
> Igor
> 
> 
> On Thu, Jan 9, 2020 at 10:12 PM Paul Rogers <par0328@yahoo.com.invalid>
> wrote:
> 
>> Hi Volodymyr,
>> 
>> All good points. The Arrow/Drill conversion is a good option, especially
>> for readers and clients. Between operators, such conversion is likely to
>> introduce performance hits. As you know, the main feature that
>> differentiates one query engine from another is performance, so adding
>> conversions is unlikely to help Drill in the performance battle.
>> 
>> Flatten should actually be pretty simple with EVF. Creating repeated
>> values is much like filling in implicit columns: set a value, then "copy
>> down" n times.
>> 
>> Still, you raise good issues. Operators that fit your description are
>> things like exchanges: these operators want to work at the low level of
>> buffers. Not much the column readers/writers can do to help. And, as you
>> point out, commercial components will be a challenge as Apache Drill does
>> not maintain that code.
>> 
>> Your larger point is valid: no matter how we approach it, moving to Arrow
>> is a large project that will break compatibility.
>> 
>> We've discussed a simple first step: Support an Arrow client to see if
>> there is any interest. Support Arrow readers to see if that gives us any
>> benefit. These are the visible tip of the iceberg. If we see advantages, we
>> can then think about changing the internals; the vast bulk of the iceberg
>> which is below water and unseen.
>> 
>> I think I disagree that we'd want to swap code that works directly with
>> ValueVectors to code that works directly with ArrowVectors. Doing so locks
>> us into someone else's memory format and forces Drill to change every time
>> Arrow changes. Give the small size of the Drill team, and the frantic pace
>> of Arrow change, This was the team's concern early on and I'm still not
>> convinced this is a good strategy.
>> 
>> So, my original point remains: what is the benefit of all this cost? Is
>> there another path that gives us greater benefit for the same or lesser
>> cost? In short, what is our goal? Where do we want Drill to go?
>> 
>> In fact, another radical suggestion is to embrace the wonderful work done
>> on Presto. Maybe Drill 2.0 is simply Presto. We focus on adding support for
>> local files (Drill's unique strength), and embrace Presto's great support
>> for data types, connectors, UDFs, clients and so on.
>> 
>> As a team, we should ask the fundamental question: What benefits can Drill
>> offer that are not already offered by, say Presto or commercial Drill
>> derivatives? If new can answer that question, we'll have a better idea
>> about whether investment in Arrow will get us there.
>> 
>> Or, are we better off to just leave well enough alone as we have done for
>> several years?
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Thursday, January 9, 2020, 05:57:52 AM PST, Volodymyr Vysotskyi <
>> volodymyr@apache.org> wrote:
>> 
>> Hi all,
>> 
>> Glad to see that this discussion became active again!
>> 
>> I have some comments regarding the steps for moving from Drill Vectors to
>> Arrow Vectors.
>> 
>> No doubt that using EVF for all operators and readers instead of value
>> vectors will simplify things a lot.
>> But considering the target goal - integration with Arrow, it may be the
>> main show-stopper for it.
>> There may be some operators which would be hard to adapt to use EVF, for
>> example, I think Flatten operator will be among them since its
>> implementation deeply connected with value vectors.
>> Also, it requires moving all storage and format plugins to EVF, which also
>> may be problematic, for example, some plugins like MaprDB have specific
>> features, and it should be considered when moving to EVF.
>> Some other plugins are so obsolete, that I'm not sure that they still work
>> and that someone still uses it, so except moving to EVF, they should be
>> resurrected to verify that they weren't broken more than before.
>> 
>> This is a huge piece of work, and only after that, we will proceed with the
>> next step - integrating Arrow to EVF and then handling new Arrow-related
>> issues for all the operators and readers at the same time.
>> 
>> I propose to update these steps a little bit.
>> 1. I agree that at first, we should extract EVF-related classes into a
>> separate module.
>> 2. But as the next step, I propose to extract EVF API which doesn't depend
>> on the vector implementation (Drill vectors, or Arrow ones).
>> 3. After that, introduce module with Arrow which also implements this EVF
>> API.
>> 4. Introduce transformers that will be able to convert from Drill vectors
>> into Arrow vectors and vice versa. These transformers may be implemented to
>> work using EVF abstractions instead of operating with specific vector
>> implementations.
>> 
>> 5.1. At this point, we can introduce Arrow connectors to fetch the data in
>> Arrow format or return it in such a format using transformers from step 4.
>> 
>> 5.2. Also, at this point, we may start rewriting operators to EVF and
>> switching EVF implementation from the EVF based on Drill Vectors to the
>> implementation which uses Arrow Vectors. Or switching implementations for
>> existing EVF-based format plugins and fix newly discovered issues in Arrow.
>> Since at this point we will have operators which use Arrow format and
>> operators which use Drill Vectors format, we should insert operators that
>> transform one vector format to another introduced in step 4 between every
>> pair of operators which returns batches in a different format.
>> 
>> I know, that such an approach requires some additional work, like
>> introducing transformers from step 4 and may cause some performance
>> degradations for the case when format transformation is complex for some
>> types and when we still have sequences of operators with different formats.
>> 
>> But with this approach, transitioning to Arrow wouldn't be blocked until
>> everything is moved to EVF and it would be possible to transmit
>> step-by-step, and Drill still will be able to switch between formats if it
>> would be required.
>> 
>> Kind regards,
>> Volodymyr Vysotskyi
>> 
>> 
>> On Thu, Jan 9, 2020 at 2:45 PM Igor Guzenko <ihor.huzenko.igs@gmail.com>
>> wrote:
>> 
>>> Hi Paul,
>>> 
>>> Though I have very limited knowledge about Arrow at the moment, I can
>>> highlight a few advantages of trying it:
>>> 1. Allows fixing all the long-standing nullability issues and provide
>>> better integration for storage plugins like Hive.
>>>               https://jira.apache.org/jira/browse/DRILL-1344
>>>               https://jira.apache.org/jira/browse/DRILL-3831
>>>               https://jira.apache.org/jira/browse/DRILL-4824
>>>               https://jira.apache.org/jira/browse/DRILL-7255
>>>               https://jira.apache.org/jira/browse/DRILL-7366
>>> 2. Some work was done by community to implement optimized Arrow readers
>> for
>>> Parquet and other formats&tools. We could try to adopt and check whether
>> we
>>> can benefit from them.
>>> 3. Since Arrow is under active development we could try their newest
>>> features, like Flight which promises improved data transfers over the
>>> network.
>>> 
>>> Thanks,
>>> Igor
>>> On Wed, Jan 8, 2020 at 11:55 PM Paul Rogers <par0328@yahoo.com.invalid>
>>> wrote:
>>> 
>>>> Hi Igor,
>>>> 
>>>> Before diving into design issues, it may be worthwhile to think about
>> the
>>>> premise: should Drill adopt Arrow as its internal memory layout? This
>> is
>>>> the question that the team has wrestled with since Arrow was launched.
>>>> Arrow has three parts. Let's think about each.
>>>> 
>>>> First is a direct memory layout. The approach you suggest will let us
>>> work
>>>> with the Arrow memory format. Use EVF to access vectors, then the
>>>> underlying vectors can be swapped from Drill to Arrow. But, what is the
>>>> advantage of using Arrow? The arrow layout isn't better than Drill's;
>> it
>>> is
>>>> just different. Adopting the Arrow memory layout by itself provides
>>> little
>>>> benefit, but bit cost. This is one reason the team has been so
>> reluctant
>>> to
>>>> atop Arrow.
>>>> 
>>>> The only advantage of using the Arrow memory layout is if Drill could
>>>> benefit from code written for Arrow. The second part of Arrow is a set
>> of
>>>> modules to manipulate vectors. Gandiva is the most prominent example.
>>>> However, there are major challenges. Most SQL operations are defined to
>>>> work on rows; some clever thinking will be needed to convert those
>>>> operations into a series of column operations. (Drill's codegen is NOT
>>>> columnar: it works row-by-row.) So, if we want to benefit from Gandiva,
>>> we
>>>> must completely rethink how we process batches.
>>>> 
>>>> Is it worth doing all that work? The primary benefit would be
>>> performance.
>>>> But, it is not clear that our current implementation is the bottleneck.
>>> The
>>>> current implementation is row-based, code generated in Java. Would be
>>> great
>>>> for someone to do some benchmarks to show the benefit from adopting
>>> Gandiva
>>>> to see if the potential gain justifies the likely large development
>> cost.
>>>> 
>>>> The third advantage of using Arrow is to allow exchange of vectors
>>> between
>>>> Drill and Arrow-based clients or readers. As it turns out, this is not
>>> the
>>>> big win it seems. As we've discussed, we could easily create an
>>> Arrow-based
>>>> client for Drill -- there will be an RPC between the client and Drill
>> and
>>>> we can use that to do format conversion.
>>>> 
>>>> For readers, Drill will want control over batch sizes; Drill cannot
>>>> blindly accept whatever size vectors a reader chooses to produce. (More
>>> on
>>>> that later.) Incoming data will be subject to projection and selection,
>>> so
>>>> it will quickly move out of the incoming Arrow vectors into vector
>> which
>>>> Drill creates.
>>>> 
>>>> Arrow gets (or got) a lot of press. However, our job is to focus on
>>> what's
>>>> best for Drill. There actually might be a memory layout for Drill that
>> is
>>>> better than Arrow (and better than our current vectors.) A couple of us
>>> did
>>>> a prototype some time ago that seemed to show promise. So, it is not
>>> clear
>>>> that adopting Arrow is necessarily a huge win: maybe it is, maybe not.
>> We
>>>> need to figure it out.
>>>> 
>>>> What IS clearly a huge win is the idea you outlined: creating a layer
>>>> between memory layout and the rest of Drill so that we can try out
>>>> different memory layouts to see what works best.
>>>> 
>>>> Thanks,
>>>> - Paul
>>>> 
>>>> 
>>>> 
>>>>   On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <
>>>> ihor.huzenko.igs@gmail.com> wrote:
>>>> 
>>>> Hello Paul,
>>>> 
>>>> I totally agree that integrating Arrow by simply replacing Vectors
>> usage
>>>> everywhere will cause a disaster.
>>>> After the first look at the new *E*nhanced*V*ector*F*ramework and based
>>> on
>>>> your suggestions I think I have an idea to share.
>>>> In my opinion, the integration can be done in the two major stages:
>>>> 
>>>> *1. Preparation Stage*
>>>>     1.1 Extract all EVF and related components to a separate module.
>> So
>>>> the new separate module will depend only upon Vectors module.
>>>>     1.2 Step-by-step rewriting of all operators to use a higher-level
>>>> EVF module and remove Vectors module from exec and modules
>> dependencies.
>>>>     1.3 Ensure that only module which depends on Vectors is the new
>> EVF
>>>> one.
>>>> *2. Integration Stage*
>>>>       2.1 Add dependency on Arrow Vectors module into EVF module.
>>>>       2.2 Replace all usages of Drill Vectors & Protobuf Meta with
>>> Arrow
>>>> Vectors & Flatbuffers Meta in EVF module.
>>>>       2.3 Finalize integration by removing Drill Vectors module
>>>> completely.
>>>> 
>>>> 
>>>> *NOTE:* I think that any way we won't preserve any backward
>> compatibility
>>>> for drivers and custom UDFs.
>>>> And proposed changes are a major step forward to be included in Drill
>> 2.0
>>>> version.
>>>> 
>>>> 
>>>> Below is the very first list of packages that in future may be
>>> transformed
>>>> into EVF module:
>>>> *Module:* exec/Vectors
>>>> *Packages:*
>>>> org.apache.drill.exec.record.metadata - (An enhanced set of classes to
>>>> describe a Drill schema.)
>>>> org.apache.drill.exec.record.metadata.schema.parser
>>>> 
>>>> org.apache.drill.exec.vector.accessor - (JSON-like readers and writers
>>> for
>>>> each kind of Drill vector.)
>>>> org.apache.drill.exec.vector.accessor.convert
>>>> org.apache.drill.exec.vector.accessor.impl
>>>> org.apache.drill.exec.vector.accessor.reader
>>>> org.apache.drill.exec.vector.accessor.writer
>>>> org.apache.drill.exec.vector.accessor.writer.dummy
>>>> 
>>>> *Module:* exec/Java Execution Engine
>>>> *Packages:*
>>>> org.apache.drill.exec.physical.rowSet - (Record batches management)
>>>> org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory
>>>> mgmt)
>>>> org.apache.drill.exec.physical.impl.scan - (Row set based scan)
>>>> 
>>>> Thanks,
>>>> Igor Guzenko
>>>> 
>>>> 
>>> 
>> 


Mime
View raw message