spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Tajeldin EDU <alitedu1...@gmail.com>
Subject Re: Problem of RDD in calculation
Date Sat, 17 Oct 2015 01:22:49 GMT
Since DF2 only has the userID, I'm assuming you are musing DF2 to filter for desired userIDs.
You can just use the join() and groupBy operations on DataFrame to do what you desire.  For
example:

scala> val df1=app.createDF("id:String; v:Integer", "X,1;X,2;Y,3;Y,4;Z,10")
df1: org.apache.spark.sql.DataFrame = [id: string, v: int]

scala> df1.show
+---+---+
| id|  v|
+---+---+
|  X|  1|
|  X|  2|
|  Y|  3|
|  Y|  4|
|  Z| 10|
+---+---+

scala> val df2=app.createDF("id:String", "X;Y")
df2: org.apache.spark.sql.DataFrame = [id: string]

scala> df2.show
+---+
| id|
+---+
|  X|
|  Y|
+---+

scala> df1.join(df2, "id").groupBy("id").agg(avg("v") as "avg_v", min("v") as "min_v").show
+---+-----+-----+
| id|avg_v|min_v|
+---+-----+-----+
|  X|  1.5|    1|
|  Y|  3.5|    3|
|---+-----+-----+


Notes:
* The above uses createDF method in SmvApp from SMV package, but the rest of the code is just
standard Spark DataFrame ops.
* One advantage of doing this using DataFrame rather than SQL is that you can build the expressions
programmatically (e.g. imagine doing this for 100 columns instead of 2).

---
Ali


On Oct 16, 2015, at 12:10 PM, ChengBo <Cheng.Bo@huawei.com> wrote:

> Hi all,
>  
> I am new in Spark, and I have a question in dealing with RDD.
> I’ve converted RDD to DataFrame. So there are two DF: DF1 and DF2
> DF1 contains: userID, time, dataUsage, duration
> DF2 contains: userID
>  
> Each userID has multiple rows in DF1.
> DF2 has distinct userID, and I would like to compute the average, max and min value of
both dataUsage and duration for each userID in DF1?
> And store the results in a new dataframe.
> How can I do that?
> Thanks a lot.
>  
> Best
> Frank


Mime
View raw message