spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Gautier <tim.gaut...@gmail.com>
Subject Re: I'm pretty sure this is a Dataset bug
Date Fri, 27 May 2016 17:16:39 GMT
I figured it out the trigger. Turns out it wasn't because I loaded it from
the database, it was because the first thing I do after loading is to lower
case all the strings. After a Dataset has been mapped, the resulting
Dataset can't be self joined. Here's a test case that illustrates the issue:

    case class Test(id: Int)
    val test = Seq(
      Test(1),
      Test(2),
      Test(3)
    ).toDS
    test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show //
<-- works fine
    val testMapped = test.map(_.id + 1) // add 1 to each
    testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" ===
$"t2.id").show
// <-- error


On Fri, May 27, 2016 at 10:44 AM Tim Gautier <tim.gautier@gmail.com> wrote:

> I stand corrected. I just created a test table with a single int field to
> test with and the Dataset loaded from that works with no issues. I'll see
> if I can track down exactly what the difference might be.
>
> On Fri, May 27, 2016 at 10:29 AM Tim Gautier <tim.gautier@gmail.com>
> wrote:
>
>> I'm using 1.6.1.
>>
>> I'm not sure what good fake data would do since it doesn't seem to have
>> anything to do with the data itself. It has to do with how the Dataset was
>> created. Both datasets have exactly the same data in them, but the one
>> created from a sql query fails where the one created from a Seq works. The
>> case class is just a few Option[Int] and Option[String] fields, nothing
>> special.
>>
>> Obviously there's some sort of difference between the two datasets, but
>> Spark tells me they're exactly the same type with exactly the same data, so
>> I couldn't create a test case without actually accessing a sql database.
>>
>> On Fri, May 27, 2016 at 10:15 AM Ted Yu <yuzhihong@gmail.com> wrote:
>>
>>> Which release of Spark are you using ?
>>>
>>> Is it possible to come up with fake data that shows what you described ?
>>>
>>> Thanks
>>>
>>> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier <tim.gautier@gmail.com>
>>> wrote:
>>>
>>>> Unfortunately I can't show exactly the data I'm using, but this is what
>>>> I'm seeing:
>>>>
>>>> I have a case class 'Product' that represents a table in our database.
>>>> I load that data via sqlContext.read.format("jdbc").options(...).
>>>> load.as[Product] and register it in a temp table 'product'.
>>>>
>>>> For testing, I created a new Dataset that has only 3 records in it:
>>>>
>>>> val ts = sqlContext.sql("select * from product where product_catalog_id
>>>> in (1, 2, 3)").as[Product]
>>>>
>>>> I also created another one using the same case class and data, but from
>>>> a sequence instead.
>>>>
>>>> val ds: Dataset[Product] = Seq(
>>>>       Product(Some(1), ...),
>>>>       Product(Some(2), ...),
>>>>       Product(Some(3), ...)
>>>>     ).toDS
>>>>
>>>> The spark shell tells me these are exactly the same type at this point,
>>>> but they don't behave the same.
>>>>
>>>> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
>>>> $"ts2.product_catalog_id")
>>>> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
>>>> $"ds2.product_catalog_id")
>>>>
>>>> Again, spark tells me these self joins return exactly the same type,
>>>> but when I do a .show on them, only the one created from a Seq works. The
>>>> one created by reading from the database throws this error:
>>>>
>>>> org.apache.spark.sql.AnalysisException: cannot resolve
>>>> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
>>>> ...];
>>>>
>>>> Is this a bug? Is there anyway to make the Dataset loaded from a table
>>>> behave like the one created from a sequence?
>>>>
>>>> Thanks,
>>>> Tim
>>>>
>>>
>>>

Mime
View raw message