spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: SQL query in scala API
Date Thu, 04 Dec 2014 04:08:52 GMT
You may do this:

|table("users").groupBy('zip)('zip, count('user), countDistinct('user))
|

On 12/4/14 8:47 AM, Arun Luthra wrote:

> I'm wondering how to do this kind of SQL query with PairRDDFunctions.
>
> SELECT zip, COUNT(user), COUNT(DISTINCT user)
> FROM users
> GROUP BY zip
>
> In the Spark scala API, I can make an RDD (called "users") of 
> key-value pairs where the keys are zip (as in ZIP code) and the values 
> are user id's. Then I can compute the count and distinct count like this:
>
> val count = users.mapValues(_ => 1).reduceByKey(_ + _)
> val countDistinct = users.distinct().mapValues(_ => 1).reduceByKey(_ + _)
>
> Then, if I want count and countDistinct in the same table, I have to 
> join them on the key.
>
> Is there a way to do this without doing a join (and without using SQL 
> or spark SQL)?
>
> Arun

‚Äč

Mime
View raw message