spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Lam <chiling...@gmail.com>
Subject Re: Counting distinct values for a key?
Date Mon, 20 Jul 2015 01:13:52 GMT
Hi Nikunj,

Sorry, I totally misread your question.
I think you need to first groupbykey (get all values of the same key together), then follow
by mapValues (probably put the values into a set and then take the size of it because you
want a distinct count)

HTH,

Jerry

Sent from my iPhone

> On 19 Jul, 2015, at 8:48 pm, N B <nb.nospam@gmail.com> wrote:
> 
> Hi Suyog,
> 
> That code outputs the following:
> 
> key2 val22 : 1
> key1 val1 : 2
> key2 val2 : 2
> 
> while the output I want to achieve would have been (with your example):
> 
> key1 : 2
> key2 : 2
> 
> because there are 2 distinct types of values for each key ( regardless of their actual
duplicate counts .. hence the use of the DISTINCT keyword in the query equivalent ).
> 
> Thanks
> Nikunj
> 
> 
>> On Sun, Jul 19, 2015 at 2:37 PM, suyog choudhari <suyogchoudhari@gmail.com>
wrote:
>> public static void main(String[] args) {
>> 
>> SparkConf sparkConf = new SparkConf().setAppName("CountDistinct");
>> 
>> JavaSparkContext jsc = new JavaSparkContext(sparkConf);
>> 
>> 
>> List<Tuple2<String, String>> list = new ArrayList<Tuple2<String,
String>>();
>> 
>> 
>> list.add(new Tuple2<String, String>("key1", "val1"));
>> 
>> list.add(new Tuple2<String, String>("key1", "val1"));
>> 
>> list.add(new Tuple2<String, String>("key2", "val2"));
>> 
>> list.add(new Tuple2<String, String>("key2", "val2"));
>> 
>> list.add(new Tuple2<String, String>("key2", "val22"));
>> 
>> 
>> 
>> JavaPairRDD<String, Integer> rdd =  jsc.parallelize(list).mapToPair(t ->
new Tuple2<String, Integer>(t._1 + " " +t._2, 1));
>> 
>> 
>> JavaPairRDD<String, Integer> rdd2 = rdd.reduceByKey((c1, c2) -> c1+c2 );
>> 
>> 
>> 
>> List<Tuple2<String, Integer>> output =  rdd2.collect();
>> 
>> 
>> for (Tuple2<?,?> tuple : output) {
>> 
>>         System.out.println( tuple._1() + " : " + tuple._2() );
>> 
>>     }
>> 
>> 
>> }
>> 
>> 
>>> On Sun, Jul 19, 2015 at 2:28 PM, Jerry Lam <chilinglam@gmail.com> wrote:
>>> You mean this does not work?
>>> 
>>> SELECT key, count(value) from table group by key
>>> 
>>> 
>>> 
>>>> On Sun, Jul 19, 2015 at 2:28 PM, N B <nb.nospam@gmail.com> wrote:
>>>> Hello,
>>>> 
>>>> How do I go about performing the equivalent of the following SQL clause in
Spark Streaming? I will be using this on a Windowed DStream.
>>>> 
>>>> SELECT key, count(distinct(value)) from table group by key;
>>>> 
>>>> so for example, given the following dataset in the table:
>>>> 
>>>>  key | value
>>>> -----+-------
>>>>  k1  | v1
>>>>  k1  | v1
>>>>  k1  | v2
>>>>  k1  | v3
>>>>  k1  | v3
>>>>  k2  | vv1
>>>>  k2  | vv1
>>>>  k2  | vv2
>>>>  k2  | vv2
>>>>  k2  | vv2
>>>>  k3  | vvv1
>>>>  k3  | vvv1
>>>> 
>>>> the result will be:
>>>> 
>>>>  key | count
>>>> -----+-------
>>>>  k1  |     3
>>>>  k2  |     2
>>>>  k3  |     1
>>>> 
>>>> Thanks
>>>> Nikunj
> 

Mime
View raw message