spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Hillegas <rhil...@us.ibm.com>
Subject Re: column identifiers in Spark SQL
Date Tue, 22 Sep 2015 23:05:14 GMT

Thanks for that additional tip, Michael. Backticks fix the problem query in
which an identifier was transformed into a string literal. So this works
now...

  // now correctly resolves the unnormalized column id
  sqlContext.sql("""select `b` from test_data""").show

Any suggestion about how to escape an embedded double quote?

  // java.sql.SQLSyntaxErrorException: Syntax error: Encountered "\"" at
line 1, column 12.
  sqlContext.sql("""select `c"d` from test_data""").show

  // org.apache.spark.sql.AnalysisException: cannot resolve 'c\"d' given
input columns A, b, c"d; line 1 pos 7
  sqlContext.sql("""select `c\"d` from test_data""").show

Thanks,
-Rick

Michael Armbrust <michael@databricks.com> wrote on 09/22/2015 01:16:12 PM:

> From: Michael Armbrust <michael@databricks.com>
> To: Richard Hillegas/San Francisco/IBM@IBMUS
> Cc: Dev <dev@spark.apache.org>
> Date: 09/22/2015 01:16 PM
> Subject: Re: column identifiers in Spark SQL
>
> HiveQL uses `backticks` for quoted identifiers.
>
> On Tue, Sep 22, 2015 at 1:06 PM, Richard Hillegas <rhilleg@us.ibm.com>
wrote:
> Thanks for that tip, Michael. I think that my sqlContext was a raw
> SQLContext originally. I have rebuilt Spark like so...
>
>   sbt/sbt -Phive assembly/assembly
>
> Now I see that my sqlContext is a HiveContext. That fixes one of the
> queries. Now unnormalized column names work:
>
>   // ...unnormalized column names work now
>   sqlContext.sql("""select a from test_data""").show
>
> However, quoted identifiers are still treated as string literals:
>
>   // this still returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> And embedded quotes inside quoted identifiers are swallowed up:
>
>   // this now returns rows consisting of the string literal "cd"
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> Thanks,
> -Rick
>
> Michael Armbrust <michael@databricks.com> wrote on 09/22/2015 10:58:36
AM:
>
> > From: Michael Armbrust <michael@databricks.com>
> > To: Richard Hillegas/San Francisco/IBM@IBMUS
> > Cc: Dev <dev@spark.apache.org>
> > Date: 09/22/2015 10:59 AM
> > Subject: Re: column identifiers in Spark SQL
>
> >
> > Are you using a SQLContext or a HiveContext?  The programming guide
> > suggests the latter, as the former is really only there because some
> > applications may have conflicts with Hive dependencies.  SQLContext
> > is case sensitive by default where as the HiveContext is not.  The
> > parser in HiveContext is also a lot better.
> >
> > On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas <rhilleg@us.ibm.com
> > wrote:
> > I am puzzled by the behavior of column identifiers in Spark SQL. I
> > don't find any guidance in the "Spark SQL and DataFrame Guide" at
> > http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> > seeing odd behavior related to case-sensitivity and to delimited
> > (quoted) identifiers.
> >
> > Consider the following declaration of a table in the Derby
> > relational database, whose dialect hews closely to the SQL Standard:
> >
> >    create table app.t( a int, "b" int, "c""d" int );
> >
> > Now let's load that table into Spark like this:
> >
> >   import org.apache.spark.sql._
> >   import org.apache.spark.sql.types._
> >
> >   val df = sqlContext.read.format("jdbc").options(
> >     Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
> >     "dbtable" -> "app.t")).load()
> >   df.registerTempTable("test_data")
> >
> > The following query runs fine because the column name matches the
> > normalized form in which it is stored in the metadata catalogs of
> > the relational database:
> >
> >   // normalized column names are recognized
> >   sqlContext.sql(s"""select A from test_data""").show
> >
> > But the following query fails during name resolution. This puzzles
> > me because non-delimited identifiers are case-insensitive in the
> > ANSI/ISO Standard. They are also supposed to be case-insensitive in
> > HiveQL, at least according to section 2.3.1 of the
> > QuotedIdentifier.html webpage attached to https://issues.apache.org/
> > jira/browse/HIVE-6013:
> >
> >   // ...unnormalized column names raise this error:
> > org.apache.spark.sql.AnalysisException: cannot resolve 'a' given
> > input columns A, b, c"d;
> >   sqlContext.sql("""select a from test_data""").show
> >
> > Delimited (quoted) identifiers are treated as string literals.
> > Again, non-Standard behavior:
> >
> >   // this returns rows consisting of the string literal "b"
> >   sqlContext.sql("""select "b" from test_data""").show
> >
> > Embedded quotes in delimited identifiers won't even parse:
> >
> >   // embedded quotes raise this error: java.lang.RuntimeException:
> > [1.11] failure: ``union'' expected but "d" found
> >   sqlContext.sql("""select "c""d" from test_data""").show
> >
> > This behavior is non-Standard and it strikes me as hard to describe
> > to users concisely. Would the community support an effort to bring
> > the handling of column identifiers into closer conformance with the
> > Standard? Would backward compatibility concerns even allow us to do
that?
> >
> > Thanks,
> > -Rick
Mime
View raw message