spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: Is there a reduceByKey functionality in DataFrame API?
Date Thu, 11 Aug 2016 03:42:40 GMT
Hi Luis,

You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you
can do groupBy followed by a reduce on the GroupedDataset (
http://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.GroupedDataset
) - this works on a per-key basis despite the different name. In Spark 2.0
you would use groupByKey on the Dataset followed by reduceGroups (
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.KeyValueGroupedDataset
).

Cheers,

Holden :)

On Wed, Aug 10, 2016 at 5:15 PM, luismattor <luismattor@gmail.com> wrote:

> Hi everyone,
>
> Consider the following code:
>
> val result = df.groupBy("col1").agg(min("col2"))
>
> I know that rdd.reduceByKey(func) produces the same RDD as
> rdd.groupByKey().mapValues(value => value.reduce(func)) However
> reducerByKey
> is more efficient as it avoids shipping each value to the reducer doing the
> aggregation (it ships partial aggregations instead).
>
> I wonder whether the DataFrame API optimizes the code doing something
> similar to what RDD.reduceByKey does.
>
> I am using Spark 1.6.2.
>
> Regards,
> Luis
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame-
> API-tp27508.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Mime
View raw message