spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Girardot <ssab...@gmail.com>
Subject Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
Date Fri, 15 May 2015 15:55:45 GMT
yes, please do and send me the link.
@rxin I have trouble building master, but the code is done...


Le ven. 15 mai 2015 à 01:27, Haopu Wang <HWang@qilinsoft.com> a écrit :

>  Thank you, should I open a JIRA for this issue?
>
>
>  ------------------------------
>
> *From:* Olivier Girardot [mailto:ssaboum@gmail.com]
> *Sent:* Tuesday, May 12, 2015 5:12 AM
> *To:* Reynold Xin
> *Cc:* Haopu Wang; user
> *Subject:* Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
>
>
>
> I'll look into it - not sure yet what I can get out of exprs :p
>
>
>
> Le lun. 11 mai 2015 à 22:35, Reynold Xin <rxin@databricks.com> a écrit :
>
> Thanks for catching this. I didn't read carefully enough.
>
>
>
> It'd make sense to have the udaf result be non-nullable, if the exprs are
> indeed non-nullable.
>
>
>
> On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot <ssaboum@gmail.com>
> wrote:
>
> Hi Haopu,
> actually here `key` is nullable because this is your input's schema :
>
> scala> result.printSchema
>
> root
> |-- key: string (nullable = true)
> |-- SUM(value): long (nullable = true)
>
> scala> df.printSchema
> root
> |-- key: string (nullable = true)
> |-- value: long (nullable = false)
>
>
>
> I tried it with a schema where the key is not flagged as nullable, and the
> schema is actually respected. What you can argue however is that SUM(value)
> should also be not nullable since value is not nullable.
>
>
>
> @rxin do you think it would be reasonable to flag the Sum aggregation
> function as nullable (or not) depending on the input expression's schema ?
>
>
>
> Regards,
>
>
>
> Olivier.
>
> Le lun. 11 mai 2015 à 22:07, Reynold Xin <rxin@databricks.com> a écrit :
>
> Not by design. Would you be interested in submitting a pull request?
>
>
>
> On Mon, May 11, 2015 at 1:48 AM, Haopu Wang <HWang@qilinsoft.com> wrote:
>
> I try to get the result schema of aggregate functions using DataFrame
> API.
>
> However, I find the result field of groupBy columns are always nullable
> even the source field is not nullable.
>
> I want to know if this is by design, thank you! Below is the simple code
> to show the issue.
>
> ======
>
>   import sqlContext.implicits._
>   import org.apache.spark.sql.functions._
>   case class Test(key: String, value: Long)
>   val df = sc.makeRDD(Seq(Test("k1",2),Test("k1",1))).toDF
>
>   val result = df.groupBy("key").agg($"key", sum("value"))
>
>   // From the output, you can see the "key" column is nullable, why??
>   result.printSchema
> //    root
> //     |-- key: string (nullable = true)
> //     |-- SUM(value): long (nullable = true)
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>
>

Mime
View raw message