spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: Compute Median in Spark Dataframe
Date Tue, 02 Jun 2015 20:13:28 GMT
So for column you need to pass in a Java function, I have some sample code
which does this but it does terrible things to access Spark internals.

On Tuesday, June 2, 2015, Olivier Girardot <o.girardot@lateral-thoughts.com>
wrote:

> Nice to hear from you Holden ! I ended up trying exactly that (Column) -
> but I may have done it wrong :
>
> In [*5*]: g.agg(Column("percentile(value, 0.5)"))
> Py4JError: An error occurred while calling o97.agg. Trace:
> py4j.Py4JException: Method agg([class java.lang.String, class
> scala.collection.immutable.Nil$]) does not exist
> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>
> Any idea ?
>
> Olivier.
> Le mar. 2 juin 2015 à 18:02, Holden Karau <holden@pigscanfly.ca
> <javascript:_e(%7B%7D,'cvml','holden@pigscanfly.ca');>> a écrit :
>
>> Not super easily, the GroupedData class uses a strToExpr function which
>> has a pretty limited set of functions so we cant pass in the name of an
>> arbitrary hive UDAF (unless I'm missing something). We can instead
>> construct an column with the expression you want and then pass it in to
>> agg() that way (although then you need to call the hive UDAF there). There
>> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark
>> SQL AggregateExpressions, but they are private.
>>
>> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
>> o.girardot@lateral-thoughts.com
>> <javascript:_e(%7B%7D,'cvml','o.girardot@lateral-thoughts.com');>> wrote:
>>
>>> I've finally come to the same conclusion, but isn't there any way to
>>> call this Hive UDAFs from the agg("percentile(key,0.5)") ??
>>>
>>> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska <yana.kadiyska@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','yana.kadiyska@gmail.com');>> a écrit
:
>>>
>>>> Like this...sqlContext should be a HiveContext instance
>>>>
>>>> case class KeyValue(key: Int, value: String)
>>>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
>>>> df.registerTempTable("table")
>>>> sqlContext.sql("select percentile(key,0.5) from table").show()
>>>>
>>>> ​
>>>>
>>>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
>>>> o.girardot@lateral-thoughts.com
>>>> <javascript:_e(%7B%7D,'cvml','o.girardot@lateral-thoughts.com');>>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>> Is there any way to compute a median on a column using Spark's
>>>>> Dataframe. I know you can use stats in a RDD but I'd rather stay within
a
>>>>> dataframe.
>>>>> Hive seems to imply that using ntile one can compute percentiles,
>>>>> quartiles and therefore a median.
>>>>> Does anyone have experience with this ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Olivier.
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>> Linked In: https://www.linkedin.com/in/holdenkarau
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau

Mime
View raw message