drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Givre <cgi...@gmail.com>
Subject Re: Looking for advice on integrating with a custom data source
Date Wed, 15 Jan 2020 14:19:13 GMT
Andy, 
Glad to hear you got it working!!   Can you share what data source you are working with? 
Is it completely custom to your organization?  If not, would you consider submitting this
as a pull request?
Best,
-- C



> On Jan 15, 2020, at 9:07 AM, Andy Grove <andygrove73@gmail.com> wrote:
> 
> And boom! With just 3 extra lines of code to adjust the CBO to make the row
> count inversely proportional to the number of predicates, my little Poc
> works :-)
> 
> Now that I've achieved the instant gratification (relatively speaking!) of
> making something work, I think it's time to step back and start doing this
> the right way with the PR you mentioned.
> 
> I would not have been able to get this working at all without all the
> fantastic support!
> 
> Thanks,
> 
> Andy.
> 
> 
> 
> On Tue, Jan 14, 2020 at 11:43 PM Paul Rogers <par0328@yahoo.com.invalid>
> wrote:
> 
>> Hi Andy,
>> 
>> Congratulations on making such fast progress!
>> 
>> The code to do filter pushdowns is rather complex and, it seems, most
>> plugins copy/paste the same wad of code (with the same bugs). PR 1914
>> provides a layer that converts the messy Drill logical plan into a nice,
>> simple set of predicates. You can then pick and choose which to push down,
>> allowing the framework to do the rest.
>> 
>> Note that most of the plugins do push-down as part of physical planning.
>> While this works in most case, it WILL NOT work if you are doing push-down
>> in order to shard the scan. For example, in order to divide a time range up
>> into pieces for a time series scan. The PR thus does push-down in the
>> logical phase so that we can "do the right thing."
>> 
>> When you say that getNewWithChildren() is for an earlier instance, it is
>> very likely because Calcite gave up on your filter-push-down version
>> because there was no cost reduction.
>> 
>> 
>> The Wiki page mentioned earlier explains all the copies a bit. Basically,
>> Drill creates many copies of your GroupScan as it proceeds. First a "blank"
>> one, then another with projected columns, then another full copy as Calcite
>> explores planning options, and so on.
>> 
>> One key trick is that if you implement filter push down, you MUST return a
>> lower cost estimate after the push-down than before. Else, Calcite decides
>> that it is not worth the hassle of doing the push-down if the costs remain
>> the same. See the Wiki for details. this is what getScanStats() does:
>> report stats that must get lower as you improve the scan.
>> 
>> That is, one cost at the start, a lower cost after projection push down
>> (reflecting the fact that we presumably now read less data per row) and a
>> lower cost again after filter-push down (because we read fewer rows.) There
>> is a "Dummy" storage plugin in PR 1914 that illustrates all of this.
>> 
>> Don't worry about getDigest(), it is just Calcite trying to get a label to
>> use for its internal objects. You will need to implement getString(), using
>> Drill's "EXPLAIN PLAN" format, so your scan can appear in the text plan
>> output. EXPLAIN PLAN output is:
>> 
>> ClassName [field1=x, field2=y]
>> 
>> There is a little builder in PR 1914 to do this for you.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove <
>> andygrove73@gmail.com> wrote:
>> 
>> With some extra debugging I can see that the getNewWithChildren call is
>> made to an earlier instance of GroupScan and not the instance created by
>> the filter push-down rule. I'm wondering if this is some kind of
>> hashCode/equals/toString/getDigest issue?
>> 
>> On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <andygrove73@gmail.com> wrote:
>> 
>>> I'm now working on predicate push down ... I have a filter rule that is
>>> correctly extracting the predicates that the backend database supports
>> and
>>> I am creating a new GroupScan containing these predicates, using the
>> Kafka
>>> plugin as a reference. I see the GroupScan constructor being called after
>>> this, with the predicates populated So far so good ... but then I see
>> calls
>>> to getDigest, getScanStats, and getNewWithChildren, and then I see calls
>> to
>>> the GroupScan constructor with the predicates missing.
>>> 
>>> Any pointers on what I might be missing? Is there more magic I need to
>>> know?
>>> 
>>> Thanks!
>>> 
>>> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <par0328@yahoo.com.invalid>
>>> wrote:
>>> 
>>>> Hi Andy,
>>>> 
>>>> Congrats! You are making good progress. Yes, the BatchCreator is a bit
>> of
>>>> magic: Drill looks for a subclass that has your SubScan subclass as the
>>>> second parameter. Looks like you figured that out.
>>>> 
>>>> Thanks,
>>>> - Paul
>>>> 
>>>> 
>>>> 
>>>>   On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
>>>> andygrove73@gmail.com> wrote:
>>>> 
>>>> Actually I managed to get past that error with an educated guess that
>> if
>>>> I
>>>> created a BatchCreator class, it would automagically be picked up
>> somehow.
>>>> I'm now at the point where my RecordReader is being invoked!
>>>> 
>>>> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <andygrove73@gmail.com>
>> wrote:
>>>> 
>>>>> Between reading the tutorial and copying and pasting code from the
>> Kudu
>>>>> storage plugin, I've been making reasonable progress with this but am
>> I
>>>> but
>>>>> confused by one error I'm now hitting.
>>>>> ExecutionSetupException: Failure finding OperatorCreator constructor
>> for
>>>>> config com.mydb.MyDbSubScan
>>>>> Prior to this, Drill had called getSpecificScan and then called a few
>> of
>>>>> the methods on my subscan object. I wasn't sure what to return for
>>>>> getOperatorType so just returned the kudu subscan operator type and
>> I'm
>>>>> wondering if the issue is related to that somehow?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>> On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <andygrove73@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Thank you both for the those responses. This is very helpful. I have
>>>>>> ordered a copy of the book too. I'm using Drill 1.17.0.
>>>>>> 
>>>>>> I'll take a look at the Jdbc Storage Plugin code and see if it would
>> be
>>>>>> feasible to add the logic I need there. In parallel, I've started
>>>>>> implementing a new storage plugin. I'll be working on this more
>>>> tomorrow
>>>>>> and I'm sure I'll be back with more questions soon.
>>>>>> 
>>>>>> Thanks again for your help!
>>>>>> 
>>>>>> Andy.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cgivre@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> HI Andy,
>>>>>>> Thanks for your interest in Drill.  I'm glad to see that Paul
wrote
>>>> you
>>>>>>> back as well.  I was going to say I thought the JDBC storage
plugin
>>>> did in
>>>>>>> fact push down columns and filters to the source system.
>>>>>>> 
>>>>>>> Also, what version of Drill are you using?
>>>>>>> 
>>>>>>> Writing a storage plugin for Drill is not trivial and I'd definitely
>>>>>>> recommend using the code from Paul's PR as that greatly simplifies
>>>> things.
>>>>>>> Here is a tutorial as well:
>>>>>>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>>>>>>> 
>>>>>>> If you need additional help, please let us know.
>>>>>>> -- C
>>>>>>> 
>>>>>>> 
>>>>>>> On Jan 11, 2020, at 5:57 PM, Andy Grove <andygrove73@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'd like to use Apache Drill with a custom data source that
>> supports a
>>>>>>> subset of SQL.
>>>>>>> 
>>>>>>> My goal is to have Drill push selection and predicates down to
my
>> data
>>>>>>> source but the rest of the query processing should take place
in
>>>> Drill.
>>>>>>> 
>>>>>>> I started out by writing a JDBC driver for the data source and
>>>>>>> registering
>>>>>>> that with Drill using the Jdbc Storage Plugin but it seems to
just
>>>> pass
>>>>>>> the
>>>>>>> whole query through to my data source, so that approach isn't
going
>> to
>>>>>>> work
>>>>>>> unless I'm missing something?
>>>>>>> 
>>>>>>> Is there any way to configure the JDBC storage plugin to only
push
>>>>>>> certain
>>>>>>> parts of the query to the data source?
>>>>>>> 
>>>>>>> If this isn't a good approach, do I need to write a custom storage
>>>>>>> plugin?
>>>>>>> Can these be added on the classpath or would that require me
>>>> maintaining
>>>>>>> a
>>>>>>> fork of the project?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I appreciate any pointers anyone can give me.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Andy.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>>> 
>> 


Mime
View raw message