spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: Compute Median in Spark Dataframe
Date Thu, 04 Jun 2015 23:03:40 GMT
My current example doesn't use a Hive UDAF, but you would  do something
pretty similar (it calls a new user defined UDAF, and there are wrappers to
make Spark SQL UDAFs from Hive UDAFs but they are private). So this is
doable, but since it pokes at internals it will likely break between
versions of Spark. If you want to see the WIP PR I have with Sparkling
Pandas its at
https://github.com/sparklingpandas/sparklingpandas/pull/90/files . If your
doing this in JVM and just want to know how to wrap the Hive UDAF, you can
grep/look in sql/hive/ in Spark, but I'd encourage you to see if there is
another way to accomplish what you want (since poking at the internals is
kind of dangerous).

On Thu, Jun 4, 2015 at 6:28 AM, Deenar Toraskar <deenar.toraskar@gmail.com>
wrote:

> Hi Holden, Olivier
>
>
> >>So for column you need to pass in a Java function, I have some sample
> code which does this but it does terrible things to access Spark internals.
> I also need to call a Hive UDAF in a dataframe agg function. Are there any
> examples of what Column expects?
>
> Deenar
>
> On 2 June 2015 at 21:13, Holden Karau <holden@pigscanfly.ca> wrote:
>
>> So for column you need to pass in a Java function, I have some sample
>> code which does this but it does terrible things to access Spark internals.
>>
>>
>> On Tuesday, June 2, 2015, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> wrote:
>>
>>> Nice to hear from you Holden ! I ended up trying exactly that (Column) -
>>> but I may have done it wrong :
>>>
>>> In [*5*]: g.agg(Column("percentile(value, 0.5)"))
>>> Py4JError: An error occurred while calling o97.agg. Trace:
>>> py4j.Py4JException: Method agg([class java.lang.String, class
>>> scala.collection.immutable.Nil$]) does not exist
>>> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>>
>>> Any idea ?
>>>
>>> Olivier.
>>> Le mar. 2 juin 2015 à 18:02, Holden Karau <holden@pigscanfly.ca> a
>>> écrit :
>>>
>>>> Not super easily, the GroupedData class uses a strToExpr function which
>>>> has a pretty limited set of functions so we cant pass in the name of an
>>>> arbitrary hive UDAF (unless I'm missing something). We can instead
>>>> construct an column with the expression you want and then pass it in to
>>>> agg() that way (although then you need to call the hive UDAF there). There
>>>> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark
>>>> SQL AggregateExpressions, but they are private.
>>>>
>>>> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>
>>>>> I've finally come to the same conclusion, but isn't there any way to
>>>>> call this Hive UDAFs from the agg("percentile(key,0.5)") ??
>>>>>
>>>>> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska <yana.kadiyska@gmail.com>
>>>>> a écrit :
>>>>>
>>>>>> Like this...sqlContext should be a HiveContext instance
>>>>>>
>>>>>> case class KeyValue(key: Int, value: String)
>>>>>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
>>>>>> df.registerTempTable("table")
>>>>>> sqlContext.sql("select percentile(key,0.5) from table").show()
>>>>>>
>>>>>> ​
>>>>>>
>>>>>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
>>>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>> Is there any way to compute a median on a column using Spark's
>>>>>>> Dataframe. I know you can use stats in a RDD but I'd rather stay
within a
>>>>>>> dataframe.
>>>>>>> Hive seems to imply that using ntile one can compute percentiles,
>>>>>>> quartiles and therefore a median.
>>>>>>> Does anyone have experience with this ?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Olivier.
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Linked In: https://www.linkedin.com/in/holdenkarau
>>>>
>>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>> Linked In: https://www.linkedin.com/in/holdenkarau
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau

Mime
View raw message