PR is opened : https://github.com/apache/spark/pull/6237

Le ven. 15 mai 2015 à 17:55, Olivier Girardot <ssaboum@gmail.com> a écrit :
yes, please do and send me the link.
@rxin I have trouble building master, but the code is done...


Le ven. 15 mai 2015 à 01:27, Haopu Wang <HWang@qilinsoft.com> a écrit :

Thank you, should I open a JIRA for this issue?

 


From: Olivier Girardot [mailto:ssaboum@gmail.com]
Sent: Tuesday, May 12, 2015 5:12 AM
To: Reynold Xin
Cc: Haopu Wang; user
Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

 

I'll look into it - not sure yet what I can get out of exprs :p 

 

Le lun. 11 mai 2015 à 22:35, Reynold Xin <rxin@databricks.com> a écrit :

Thanks for catching this. I didn't read carefully enough.

 

It'd make sense to have the udaf result be non-nullable, if the exprs are indeed non-nullable.

 

On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot <ssaboum@gmail.com> wrote:

Hi Haopu, 
actually here `key` is nullable because this is your input's schema : 

scala> result.printSchema

root
|-- key: string (nullable = true)
|-- SUM(value): long (nullable = true)

scala> df.printSchema
root
|-- key: string (nullable = true)
|-- value: long (nullable = false)

 

I tried it with a schema where the key is not flagged as nullable, and the schema is actually respected. What you can argue however is that SUM(value) should also be not nullable since value is not nullable.

 

@rxin do you think it would be reasonable to flag the Sum aggregation function as nullable (or not) depending on the input expression's schema ?

 

Regards, 

 

Olivier.

Le lun. 11 mai 2015 à 22:07, Reynold Xin <rxin@databricks.com> a écrit :

Not by design. Would you be interested in submitting a pull request?

 

On Mon, May 11, 2015 at 1:48 AM, Haopu Wang <HWang@qilinsoft.com> wrote:

I try to get the result schema of aggregate functions using DataFrame
API.

However, I find the result field of groupBy columns are always nullable
even the source field is not nullable.

I want to know if this is by design, thank you! Below is the simple code
to show the issue.

======

  import sqlContext.implicits._
  import org.apache.spark.sql.functions._
  case class Test(key: String, value: Long)
  val df = sc.makeRDD(Seq(Test("k1",2),Test("k1",1))).toDF

  val result = df.groupBy("key").agg($"key", sum("value"))

  // From the output, you can see the "key" column is nullable, why??
  result.printSchema
//    root
//     |-- key: string (nullable = true)
//     |-- SUM(value): long (nullable = true)


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org