drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Zarei <alexanderz.si...@gmail.com>
Subject Re: Recursive CTE Support in Drill
Date Thu, 27 Aug 2015 23:32:15 GMT
I just wanted to thank you Ted, Jacques and Jason, for suggesting and
helping crafting the query; And also for providing TPC-H Parquet data. I
was able to get a huge flow of data out of a Drill bit without having to
read much from storage by using the crossjoin! Basically, I read the first
few hundred rows and cross join them three times to get 48 million rows.

There is an initial delay doing the crossjoin, around 20 seconds but then
the data is flooded out.

Thanks,
Alex


On Mon, Jul 20, 2015 at 2:51 PM, Alexander Zarei <alexanderz.simba@gmail.com
> wrote:

> Thanks for more elaboration Ted, Jacques and Jason!
>
> @Ted that is a very cool idea. I tried the cross join but figured cross
> join is not supported in drill yet but we have DRILL-786 for it. The new
> method looks very promising. It seems it is an implicit cross join, isn't
> it? I just tried it out and it worked like a charm. I will go on with this
> method.
>
> @Jaques, yes as Jason said, we discussed this before and I have talked to
> my colleagues to help me with modifying the ODBC driver so it sends a plan.
> Also thanks for the query. I tied it out for tow tables and it worked find
> but extending it to three tables gives me a syntax error.
>
> select * from
>
> ((select column1, 1 as join_keyb from
>
>  (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1
>
>   Join
>
>  (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2
>
>   on t1.join_key=t2.join_key) t12
>
> Join
>
> (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3
>
>  on t12.join_keyb=t3.join_key)
>
>
>
> *The other syntax was easier for me to use the join three times so I could
> test it with *
>
>
>  select t1.column1 from
>
>  (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t1,
>
>  (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t2,
>
>  (SELECT column1, 1 as join_key FROM `hive43`.`default`.`double_table`) t3
>
>  where
>
>  t1.join_key=t2.join_key and t1.join_key=t3.join_key
>
>
> Thank you very much for your time Ted, Jacques and Jason!
>
> Thanks,
> Alex
>
> On Fri, Jul 17, 2015 at 2:09 PM, Jason Altekruse <altekrusejason@gmail.com
> > wrote:
>
>> Jacques,
>>
>> Alexander has brought up this problem previously in one of the hangouts
>> and
>> said that submitting a physical plan was not possible through ODBC. If he
>> is able to modify the driver code to make it possible to submit one, that
>> would be an option, as I believe the C++ client is capable of submitting
>> plans. The issue I seem to recall him mentioning is that the ODBC driver
>> was running a little sanity checking on they sql query to try to prevent
>> submitting complete garbage queries to a server. I think he had concerns
>> that a JSON formatted physical plan would fail these checks and he would
>> have to disable them along with trying to allow submitting two types of
>> queries from ODBC.
>>
>> On Fri, Jul 17, 2015 at 8:52 AM, Jacques Nadeau <jacques@dremio.com>
>> wrote:
>>
>> > Removing cross posting
>> >
>> > Alexander,
>> >
>> > There is currently no way for Drill to generate a large amount of data
>> > using SQL.  However, you can generate large generic data by using the
>> > MockStoragePlugin if you submit a plan.  You can find an example plan
>> using
>> > this at [1].
>> >
>> > I heard someone might be working on extending the MockStoragePlugin to
>> > support SQL which would provide the outcome you requested.
>> >
>> > [1]
>> >
>> >
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/mock-scan.json
>> >
>> > On Thu, Jul 16, 2015 at 10:16 PM, Ted Dunning <ted.dunning@gmail.com>
>> > wrote:
>> >
>> > >
>> > > Also, just doing a Cartesian join of three copies of 1000 records will
>> > > give you a billion records with negligible I/o.
>> > >
>> > > Sent from my iPhone
>> > >
>> > > > On Jul 16, 2015, at 15:43, Jason Altekruse <
>> altekrusejason@gmail.com>
>> > > wrote:
>> > > >
>> > > > @Alexander If you want to test the speed of the ODBC driver you can
>> do
>> > > that
>> > > > without a new storage plugin.
>> > > >
>> > > > If you get the entire dataset into memory, it will be returned from
>> > > Drill a
>> > > > quickly as we can possibly send it to the client. One way to do
>> this is
>> > > to
>> > > > insert a sort; we cannot send along any of the data until the
>> compete
>> > > sort
>> > > > is done. As long as you don't read so much data that we will start
>> > > spilling
>> > > > the sort to disk, all of the records will be in memory. To take the
>> > read
>> > > > and sort time out of your test, just make sure to record the time
>> you
>> > > first
>> > > > receive data from Drill, not the query start time.
>> > > >
>> > > > There is one gotcha here. To make the BI tools more responsive, we
>> > > > implemented a feature that will send along one empty batch of
>> records
>> > > with
>> > > > the schema information populated. This schema is generated by
>> applying
>> > > all
>> > > > of the transformations that happen throughout the query. For
>> example,
>> > the
>> > > > join operator handles this schema population by sending along the
>> > schema
>> > > > merged from the two sides of the join, project will similarly add
or
>> > > remove
>> > > > column based on the expressions and columns requested. You will
>> want to
>> > > > make sure you record your start time when you receive the first
>> batch
>> > > with
>> > > > actual records. This can give you an accurate measurement of the
>> ODBC
>> > > > performance, removing the bottleneck of the disk.
>> > > >
>> > > > On Thu, Jul 16, 2015 at 3:24 PM, Alexander Zarei <
>> > > alexanderz.simba@gmail.com
>> > > >> wrote:
>> > > >
>> > > >> Thanks for the answers.
>> > > >>
>> > > >> @Ted my only goal is to pump a large amount of data without having
>> to
>> > > read
>> > > >> from Hard Disk. I am measuring the ODBC driver performance and
I
>> need
>> > a
>> > > >> higher data transfer rate. So any method that helps pumping data
>> out
>> > of
>> > > >> Drill faster would help. The log-synth seems a good way to generate
>> > data
>> > > >> for testing. However, I'd need a ram only option which hopefully
>> > > provides a
>> > > >> higher throughput.
>> > > >>
>> > > >> @Jacques How involved is it to write a dummy plugin that returns
>> one
>> > > >> hardcoded row repeatedly 12 million times?
>> > > >>
>> > > >> Thanks,
>> > > >> Alex
>> > > >>
>> > > >> On Fri, Jul 10, 2015 at 12:56 PM, Ted Dunning <
>> ted.dunning@gmail.com>
>> > > >> wrote:
>> > > >>
>> > > >>> It may be easy, but it is completely opaque about what really
>> needs
>> > to
>> > > >>> happen.
>> > > >>>
>> > > >>> For instance,
>> > > >>>
>> > > >>> 1) how is schema exposed?
>> > > >>>
>> > > >>> 2) which classes do I really need to implement?
>> > > >>>
>> > > >>> 3) how do I express partitioning of a format?
>> > > >>>
>> > > >>> 4) how do I test it?
>> > > >>>
>> > > >>> Just a bit of documentation and comments would go a very,
very
>> long
>> > > way.
>> > > >>>
>> > > >>> Even answers on the mailing list that have more details than
"oh,
>> > > that's
>> > > >>> easy".  I would be happy to transcribe answers into the code
if I
>> > could
>> > > >>> just get some.
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> On Fri, Jul 10, 2015 at 11:04 AM, Jacques Nadeau <
>> jacques@apache.org
>> > >
>> > > >>> wrote:
>> > > >>>
>> > > >>>> Creating an EasyFormatPlugin is pretty simple.  They were
>> designed
>> > to
>> > > >> get
>> > > >>>> rid of much of the scaffolding required for a standard
>> FormatPlugin.
>> > > >>>>
>> > > >>>> JSON
>> > > >>
>> > >
>> >
>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json
>> > > >>>>
>> > > >>>> Text
>> > > >>
>> > >
>> >
>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text
>> > > >>>>
>> > > >>>> AVRO
>> > > >>
>> > >
>> >
>> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/avro
>> > > >>>>
>> > > >>>> In all cases, the connection code is pretty light.  A
fully
>> > > schematized
>> > > >>>> format like log-synth should be even simpler to implement.
>> > > >>>>
>> > > >>>> On Fri, Jul 10, 2015 at 10:58 AM, Ted Dunning <
>> > ted.dunning@gmail.com>
>> > > >>>> wrote:
>> > > >>>>
>> > > >>>>> I don't think we need a full on storage plugin.  I
think a data
>> > > >> format
>> > > >>>>> should be sufficient, basically CSV on steroids.
>> > > >>>>>
>> > > >>>>>
>> > > >>>>>
>> > > >>>>>
>> > > >>>>>
>> > > >>>>> On Fri, Jul 10, 2015 at 10:47 AM, Abdel Hakim Deneche
<
>> > > >>>>> adeneche@maprtech.com
>> > > >>>>>> wrote:
>> > > >>>>>
>> > > >>>>>> Yeah, we still lack documentation on how to write
a storage
>> > plugin.
>> > > >>> One
>> > > >>>>>> advice I've been seeing a lot is to take a look
at the mongo-db
>> > > >>> plugin,
>> > > >>>>> it
>> > > >>>>>> was basically added in one single commit:
>> > > >>
>> > >
>> >
>> https://github.com/apache/drill/commit/2ca9c907bff639e08a561eac32e0acab3a0b3304
>> > > >>>>>>
>> > > >>>>>> I think this will give some general ideas on what
to expect
>> when
>> > > >>>> writing
>> > > >>>>> a
>> > > >>>>>> storage plugin.
>> > > >>>>>>
>> > > >>>>>> On Fri, Jul 10, 2015 at 9:10 AM, Ted Dunning <
>> > > >> ted.dunning@gmail.com>
>> > > >>>>>> wrote:
>> > > >>>>>>
>> > > >>>>>>> Hakim,
>> > > >>>>>>>
>> > > >>>>>>> Not yet.  Still very much in the stage of
gathering feedback.
>> > > >>>>>>>
>> > > >>>>>>> I would think it very simple.  The biggest
obstacles are
>> > > >>>>>>>
>> > > >>>>>>> 1) no documentation on how to write a data
format
>> > > >>>>>>>
>> > > >>>>>>> 2) I need to release a jar for log-synth to
Maven Central.
>> > > >>>>>>>
>> > > >>>>>>>
>> > > >>>>>>>
>> > > >>>>>>>
>> > > >>>>>>> On Fri, Jul 10, 2015 at 8:17 AM, Abdel Hakim
Deneche <
>> > > >>>>>>> adeneche@maprtech.com>
>> > > >>>>>>> wrote:
>> > > >>>>>>>
>> > > >>>>>>>> @Ted, the log-synth storage format would
be really useful.
>> I'm
>> > > >>>>> already
>> > > >>>>>>>> seeing many unit tests that could benefit
from this. Do you
>> > > >> have
>> > > >>> a
>> > > >>>>>> github
>> > > >>>>>>>> repo for your ongoing work ?
>> > > >>>>>>>>
>> > > >>>>>>>> Thanks!
>> > > >>>>>>>>
>> > > >>>>>>>> On Thu, Jul 9, 2015 at 10:56 PM, Ted Dunning
<
>> > > >>>> ted.dunning@gmail.com>
>> > > >>>>>>>> wrote:
>> > > >>>>>>>>
>> > > >>>>>>>>> Are you hard set on using common table
expressions?
>> > > >>>>>>>>>
>> > > >>>>>>>>> I have discussed a bit off-list creating
a data format that
>> > > >>> would
>> > > >>>>>> allow
>> > > >>>>>>>>> tables to be read from a log-synth
[1] schema.  That would
>> > > >> let
>> > > >>>> you
>> > > >>>>>> read
>> > > >>>>>>>> as
>> > > >>>>>>>>> much data as you might like with an
arbitrarily complex (or
>> > > >>>> simple)
>> > > >>>>>>>> query.
>> > > >>>>>>>>>
>> > > >>>>>>>>> Operationally, you would create a
file containing a
>> log-synth
>> > > >>>>> schema
>> > > >>>>>>> that
>> > > >>>>>>>>> has the extension .synth.  Your data
source would have to be
>> > > >>>>>> configured
>> > > >>>>>>>> to
>> > > >>>>>>>>> connect that extension with the log-synth
format.  At that
>> > > >>> point,
>> > > >>>>> you
>> > > >>>>>>>> could
>> > > >>>>>>>>> select as much or little data as you
like from the file and
>> > > >> you
>> > > >>>>> would
>> > > >>>>>>> see
>> > > >>>>>>>>> generated data rather than the schema.
>> > > >>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>> [1] https://github.com/tdunning/log-synth
>> > > >>>>>>>>>
>> > > >>>>>>>>> On Thu, Jul 9, 2015 at 11:31 AM, Alexander
Zarei <
>> > > >>>>>>>>> alexanderz.simba@gmail.com
>> > > >>>>>>>>>> wrote:
>> > > >>>>>>>>>
>> > > >>>>>>>>>> Hi All,
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> I am trying to come up with a
query which returns a given
>> > > >>>> number
>> > > >>>>> of
>> > > >>>>>>>> rows
>> > > >>>>>>>>>> without having a real table on
Storage.
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> I am hoping to achieve something
like this:
>> > > >>
>> > >
>> >
>> http://stackoverflow.com/questions/6533524/sql-select-n-records-without-a-table
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> DECLARE @start INT = 1;DECLARE
@end INT = 1000000;
>> > > >>>>>>>>>> WITH numbers AS (
>> > > >>>>>>>>>>    SELECT @start AS number
>> > > >>>>>>>>>>    UNION ALL
>> > > >>>>>>>>>>    SELECT number + 1
>> > > >>>>>>>>>>    FROM  numbers
>> > > >>>>>>>>>>    WHERE number < @end)SELECT
*FROM numbersOPTION
>> > > >>>> (MAXRECURSION
>> > > >>>>>> 0);
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> I do not actually need to create
different values and
>> > > >>> returning
>> > > >>>>>>>> identical
>> > > >>>>>>>>>> rows would work too.I just need
to bypass the "from clause"
>> > > >>> in
>> > > >>>>> the
>> > > >>>>>>>> query.
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> Thanks,
>> > > >>>>>>>>>> Alex
>> > > >>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>> --
>> > > >>>>>>>>
>> > > >>>>>>>> Abdelhakim Deneche
>> > > >>>>>>>>
>> > > >>>>>>>> Software Engineer
>> > > >>>>>>>>
>> > > >>>>>>>>  <http://www.mapr.com/>
>> > > >>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>> Now Available - Free Hadoop On-Demand
Training
>> > > >>>>>>>> <
>> > > >>
>> > >
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>> --
>> > > >>>>>>
>> > > >>>>>> Abdelhakim Deneche
>> > > >>>>>>
>> > > >>>>>> Software Engineer
>> > > >>>>>>
>> > > >>>>>>  <http://www.mapr.com/>
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>> Now Available - Free Hadoop On-Demand Training
>> > > >>>>>> <
>> > > >>
>> > >
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > > >>
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message