spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Gautier <tim.gaut...@gmail.com>
Subject Re: I'm pretty sure this is a Dataset bug
Date Fri, 27 May 2016 16:44:02 GMT
I stand corrected. I just created a test table with a single int field to
test with and the Dataset loaded from that works with no issues. I'll see
if I can track down exactly what the difference might be.

On Fri, May 27, 2016 at 10:29 AM Tim Gautier <tim.gautier@gmail.com> wrote:

> I'm using 1.6.1.
>
> I'm not sure what good fake data would do since it doesn't seem to have
> anything to do with the data itself. It has to do with how the Dataset was
> created. Both datasets have exactly the same data in them, but the one
> created from a sql query fails where the one created from a Seq works. The
> case class is just a few Option[Int] and Option[String] fields, nothing
> special.
>
> Obviously there's some sort of difference between the two datasets, but
> Spark tells me they're exactly the same type with exactly the same data, so
> I couldn't create a test case without actually accessing a sql database.
>
> On Fri, May 27, 2016 at 10:15 AM Ted Yu <yuzhihong@gmail.com> wrote:
>
>> Which release of Spark are you using ?
>>
>> Is it possible to come up with fake data that shows what you described ?
>>
>> Thanks
>>
>> On Fri, May 27, 2016 at 8:24 AM, Tim Gautier <tim.gautier@gmail.com>
>> wrote:
>>
>>> Unfortunately I can't show exactly the data I'm using, but this is what
>>> I'm seeing:
>>>
>>> I have a case class 'Product' that represents a table in our database. I
>>> load that data via sqlContext.read.format("jdbc").options(...).load.as[Product]
>>> and register it in a temp table 'product'.
>>>
>>> For testing, I created a new Dataset that has only 3 records in it:
>>>
>>> val ts = sqlContext.sql("select * from product where product_catalog_id
>>> in (1, 2, 3)").as[Product]
>>>
>>> I also created another one using the same case class and data, but from
>>> a sequence instead.
>>>
>>> val ds: Dataset[Product] = Seq(
>>>       Product(Some(1), ...),
>>>       Product(Some(2), ...),
>>>       Product(Some(3), ...)
>>>     ).toDS
>>>
>>> The spark shell tells me these are exactly the same type at this point,
>>> but they don't behave the same.
>>>
>>> ts.as("ts1").joinWith(ts.as("ts2"), $"ts1.product_catalog_id" ===
>>> $"ts2.product_catalog_id")
>>> ds.as("ds1").joinWith(ds.as("ds2"), $"ds1.product_catalog_id" ===
>>> $"ds2.product_catalog_id")
>>>
>>> Again, spark tells me these self joins return exactly the same type, but
>>> when I do a .show on them, only the one created from a Seq works. The one
>>> created by reading from the database throws this error:
>>>
>>> org.apache.spark.sql.AnalysisException: cannot resolve
>>> 'ts1.product_catalog_id' given input columns: [..., product_catalog_id,
>>> ...];
>>>
>>> Is this a bug? Is there anyway to make the Dataset loaded from a table
>>> behave like the one created from a sequence?
>>>
>>> Thanks,
>>> Tim
>>>
>>
>>

Mime
View raw message