spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Girardot <ssab...@gmail.com>
Subject Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
Date Mon, 18 May 2015 19:39:32 GMT
PR is opened : https://github.com/apache/spark/pull/6237

Le ven. 15 mai 2015 à 17:55, Olivier Girardot <ssaboum@gmail.com> a écrit :

> yes, please do and send me the link.
> @rxin I have trouble building master, but the code is done...
>
>
> Le ven. 15 mai 2015 à 01:27, Haopu Wang <HWang@qilinsoft.com> a écrit :
>
>>  Thank you, should I open a JIRA for this issue?
>>
>>
>>  ------------------------------
>>
>> *From:* Olivier Girardot [mailto:ssaboum@gmail.com]
>> *Sent:* Tuesday, May 12, 2015 5:12 AM
>> *To:* Reynold Xin
>> *Cc:* Haopu Wang; user
>> *Subject:* Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
>>
>>
>>
>> I'll look into it - not sure yet what I can get out of exprs :p
>>
>>
>>
>> Le lun. 11 mai 2015 à 22:35, Reynold Xin <rxin@databricks.com> a écrit :
>>
>> Thanks for catching this. I didn't read carefully enough.
>>
>>
>>
>> It'd make sense to have the udaf result be non-nullable, if the exprs are
>> indeed non-nullable.
>>
>>
>>
>> On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot <ssaboum@gmail.com>
>> wrote:
>>
>> Hi Haopu,
>> actually here `key` is nullable because this is your input's schema :
>>
>> scala> result.printSchema
>>
>> root
>> |-- key: string (nullable = true)
>> |-- SUM(value): long (nullable = true)
>>
>> scala> df.printSchema
>> root
>> |-- key: string (nullable = true)
>> |-- value: long (nullable = false)
>>
>>
>>
>> I tried it with a schema where the key is not flagged as nullable, and
>> the schema is actually respected. What you can argue however is that
>> SUM(value) should also be not nullable since value is not nullable.
>>
>>
>>
>> @rxin do you think it would be reasonable to flag the Sum aggregation
>> function as nullable (or not) depending on the input expression's schema ?
>>
>>
>>
>> Regards,
>>
>>
>>
>> Olivier.
>>
>> Le lun. 11 mai 2015 à 22:07, Reynold Xin <rxin@databricks.com> a écrit :
>>
>> Not by design. Would you be interested in submitting a pull request?
>>
>>
>>
>> On Mon, May 11, 2015 at 1:48 AM, Haopu Wang <HWang@qilinsoft.com> wrote:
>>
>> I try to get the result schema of aggregate functions using DataFrame
>> API.
>>
>> However, I find the result field of groupBy columns are always nullable
>> even the source field is not nullable.
>>
>> I want to know if this is by design, thank you! Below is the simple code
>> to show the issue.
>>
>> ======
>>
>>   import sqlContext.implicits._
>>   import org.apache.spark.sql.functions._
>>   case class Test(key: String, value: Long)
>>   val df = sc.makeRDD(Seq(Test("k1",2),Test("k1",1))).toDF
>>
>>   val result = df.groupBy("key").agg($"key", sum("value"))
>>
>>   // From the output, you can see the "key" column is nullable, why??
>>   result.printSchema
>> //    root
>> //     |-- key: string (nullable = true)
>> //     |-- SUM(value): long (nullable = true)
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>>
>>
>

Mime
View raw message