spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: I'm pretty sure this is a Dataset bug
Date Fri, 27 May 2016 17:26:36 GMT
I tried master branch :

scala> val testMapped = test.map(t => t.copy(id = t.id + 1))
testMapped: org.apache.spark.sql.Dataset[Test] = [id: int]

scala>  testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
t2.id").show
org.apache.spark.sql.AnalysisException: cannot resolve '`t1.id`' given
input columns: [id];
  at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62)
  at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59)
  at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
  at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
  at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)


Suggest logging a JIRA if there is none logged.

On Fri, May 27, 2016 at 10:19 AM, Tim Gautier <tim.gautier@gmail.com> wrote:

> Oops, screwed up my example. This is what it should be:
>
>     case class Test(id: Int)
>     val test = Seq(
>       Test(1),
>       Test(2),
>       Test(3)
>     ).toDS
>     test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show
>     val testMapped = test.map(t => t.copy(id = t.id + 1))
>     testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"t2.id
> ").show
>
>
> On Fri, May 27, 2016 at 11:16 AM Tim Gautier <tim.gautier@gmail.com>
> wrote:
>
>> I figured it out the trigger. Turns out it wasn't because I loaded it
>> from the database, it was because the first thing I do after loading is to
>> lower case all the strings. After a Dataset has been mapped, the resulting
>> Dataset can't be self joined. Here's a test case that illustrates the issue:
>>
>>     case class Test(id: Int)
>>     val test = Seq(
>>       Test(1),
>>       Test(2),
>>       Test(3)
>>     ).toDS
>>     test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show //
>> <-- works fine
>>     val testMapped = test.map(_.id + 1) // add 1 to each
>>     testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"
>> t2.id").show // <-- error
>>
>>
>> On Fri, May 27, 2016 at 10:44 AM Tim Gautier <tim.gautier@gmail.com>
>> wrote:
>>
>>> I stand corrected. I just created a test table with a single int field
>>> to test with and the Dataset loaded from that works with no issues. I'll
>>> see if I can track down exactly what the difference might be.
>>>
>>> On Fri, May 27, 2016 at 10:29 AM Tim Gautier <tim.gautier@gmail.com>
>>> wrote:
>>>
>>>> I'm using 1.6.1.
>>>>
>>>> I'm not sure what good fake data would do since it doesn't seem to have
>>>> anything to do with the data itself. It has to do with how the Dataset was
>>>> created. Both datasets have exactly the same data in them, but the one
>>>> created from a sql query fails where the one created from a Seq works. The
>>>> case class is just a few Option[Int] and Option[String] fields, nothing
>>>> special.
>>>>
>>>> Obviously there's some sort of difference between the two datasets, but
>>>> Spark tells me they're exactly the same type with exactly the same data,
so
>>>> I couldn't create a test case without actually accessing a sql database.
>>>>
>>>> On Fri, May 27, 2016 at 10:15 AM Ted Yu <yuzhihong@gmail.com> wrote:
>>>>
>>>>> Which release of Spark are you using ?
>>>>>
>>>>> Is it possible to come up with fake data that shows what you described
>>>>> ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier <tim.gautier@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Unfortunately I can't show exactly the data I'm using, but this is
>>>>>> what I'm seeing:
>>>>>>
>>>>>> I have a case class 'Product' that represents a table in our
>>>>>> database. I load that data via sqlContext.read.format("jdbc").options(...).
>>>>>> load.as[Product] and register it in a temp table 'product'.
>>>>>>
>>>>>> For testing, I created a new Dataset that has only 3 records in it:
>>>>>>
>>>>>> val ts = sqlContext.sql("select * from product where
>>>>>> product_catalog_id in (1, 2, 3)").as[Product]
>>>>>>
>>>>>> I also created another one using the same case class and data, but
>>>>>> from a sequence instead.
>>>>>>
>>>>>> val ds: Dataset[Product] = Seq(
>>>>>>       Product(Some(1), ...),
>>>>>>       Product(Some(2), ...),
>>>>>>       Product(Some(3), ...)
>>>>>>     ).toDS
>>>>>>
>>>>>> The spark shell tells me these are exactly the same type at this
>>>>>> point, but they don't behave the same.
>>>>>>
>>>>>> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
>>>>>> $"ts2.product_catalog_id")
>>>>>> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
>>>>>> $"ds2.product_catalog_id")
>>>>>>
>>>>>> Again, spark tells me these self joins return exactly the same type,
>>>>>> but when I do a .show on them, only the one created from a Seq works.
The
>>>>>> one created by reading from the database throws this error:
>>>>>>
>>>>>> org.apache.spark.sql.AnalysisException: cannot resolve
>>>>>> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
>>>>>> ...];
>>>>>>
>>>>>> Is this a bug? Is there anyway to make the Dataset loaded from a
>>>>>> table behave like the one created from a sequence?
>>>>>>
>>>>>> Thanks,
>>>>>> Tim
>>>>>>
>>>>>
>>>>>

Mime
View raw message