drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: New Drill R "dplyr" interface
Date Mon, 26 Dec 2016 20:30:50 GMT
2016-12-26 16:22 GMT+01:00 Bob Rudis <bob@rud.is>:

> I guess I read more into "ANSI SQL" than I should have. While it does
> handle some complex SQL generated by dplyr (which you can see at
> https://github.com/hrbrmstr/sergeant if you scroll down) there are
> instances when if, say you try to filter and left join two different
> tables it generates valid SQL that Drill dislikes. You can see the
> dplyr that fails here: http://rpubs.com/hrbrmstr/drillex
>
> dplyr generates the following SQL:
>
>     SELECT *
>     FROM ( SELECT *
>            FROM ( SELECT *
>                   FROM  dfs.tmp.`/in/c.parquet`
>                   WHERE ( cyl  = 6.0 ) ) xpintqtlzw
>            LEFT JOIN dfs.tmp.`/in/b.json`
>            USING ( car ) ) fyxhnlrrca
>     LIMIT 10
>
> The error message (if one doesn't want to go to the rpubs URL) is:
>
>     SYSTEM ERROR: DrillRuntimeException: Join only supports implicit
> casts between
>     1. Numeric data
>     2. Varchar, Varbinary data
>     3. Date, Timestamp data Left type: INT, Right type: VARCHAR.
>     Add explicit casts to avoid this error
>
> "car" is the same in both data sets as it's just different column
> slices of the seminal mtcars data set but with a common `car` column.
> It fails if both files are the same file format.
>
> Also, when attempting to make a "data frame to parquet" function which
> sends a CTAS query with > 1000 (VALUES((1,2,3,...)) rows Drill tosses
> stack exceptions and hangs (in the embedded console, localhost console
> and REST POST query). Said operation is the only thing that really
> makes sense for such a convenience function since one could just
> `jsonlite::stream_out(...)` a data frame and CTAS from the console
> otherwise (an R wrapper for that makes little sense IMO). The SparkR
> `write.parquet()` actually calls a Java function to do the conversion.
> Note that this isn't a complaint, Drill is at heart an EDA/BI
> workhorse and while it provides the data format transformation
> functions such a use of VALUES is a bit extreme.
>

​In the meantime, is the "VALUES syntax" a good way ? I guess the planner
needs to read all the SQL before writing data. Maybe data-frame -> CSV ->
CTAS FROM csv is a better way in any cases ?
​


>
> On Sun, Dec 25, 2016 at 2:40 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > On Sun, Dec 25, 2016 at 11:04 AM, Bob Rudis <bob@rud.is> wrote:
> >
> >> Drill SQL seems to not be as "standard" as it claims (some
> >> operations that are valid SQL after a series of dplyr transforms  shld
> >> work but do not).
> >>
> >
> > Can you elaborate?  What operations?
> >
> > Also, Drill claims that it is as compatible as possible where it
> implements
> > some function. It doesn't claim to be complete.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message