drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Rudis <...@rud.is>
Subject Re: New Drill R "dplyr" interface
Date Mon, 26 Dec 2016 15:22:53 GMT
I guess I read more into "ANSI SQL" than I should have. While it does
handle some complex SQL generated by dplyr (which you can see at
https://github.com/hrbrmstr/sergeant if you scroll down) there are
instances when if, say you try to filter and left join two different
tables it generates valid SQL that Drill dislikes. You can see the
dplyr that fails here: http://rpubs.com/hrbrmstr/drillex

dplyr generates the following SQL:

    SELECT *
           FROM ( SELECT *
                  FROM  dfs.tmp.`/in/c.parquet`
                  WHERE ( cyl  = 6.0 ) ) xpintqtlzw
           LEFT JOIN dfs.tmp.`/in/b.json`
           USING ( car ) ) fyxhnlrrca
    LIMIT 10

The error message (if one doesn't want to go to the rpubs URL) is:

    SYSTEM ERROR: DrillRuntimeException: Join only supports implicit
casts between
    1. Numeric data
    2. Varchar, Varbinary data
    3. Date, Timestamp data Left type: INT, Right type: VARCHAR.
    Add explicit casts to avoid this error

"car" is the same in both data sets as it's just different column
slices of the seminal mtcars data set but with a common `car` column.
It fails if both files are the same file format.

Also, when attempting to make a "data frame to parquet" function which
sends a CTAS query with > 1000 (VALUES((1,2,3,...)) rows Drill tosses
stack exceptions and hangs (in the embedded console, localhost console
and REST POST query). Said operation is the only thing that really
makes sense for such a convenience function since one could just
`jsonlite::stream_out(...)` a data frame and CTAS from the console
otherwise (an R wrapper for that makes little sense IMO). The SparkR
`write.parquet()` actually calls a Java function to do the conversion.
Note that this isn't a complaint, Drill is at heart an EDA/BI
workhorse and while it provides the data format transformation
functions such a use of VALUES is a bit extreme.

On Sun, Dec 25, 2016 at 2:40 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> On Sun, Dec 25, 2016 at 11:04 AM, Bob Rudis <bob@rud.is> wrote:
>> Drill SQL seems to not be as "standard" as it claims (some
>> operations that are valid SQL after a series of dplyr transforms  shld
>> work but do not).
> Can you elaborate?  What operations?
> Also, Drill claims that it is as compatible as possible where it implements
> some function. It doesn't claim to be complete.

View raw message