spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From N B <nb.nos...@gmail.com>
Subject Re: Counting distinct values for a key?
Date Mon, 20 Jul 2015 00:48:40 GMT
Hi Suyog,

That code outputs the following:

key2 val22 : 1
key1 val1 : 2
key2 val2 : 2

while the output I want to achieve would have been (with your example):

key1 : 2
key2 : 2

because there are 2 distinct types of values for each key ( regardless of
their actual duplicate counts .. hence the use of the DISTINCT keyword in
the query equivalent ).

Thanks
Nikunj


On Sun, Jul 19, 2015 at 2:37 PM, suyog choudhari <suyogchoudhari@gmail.com>
wrote:

> public static void main(String[] args) {
>
>  SparkConf sparkConf = new SparkConf().setAppName("CountDistinct");
>
>  JavaSparkContext jsc = new JavaSparkContext(sparkConf);
>
>   List<Tuple2<String, String>> list = new ArrayList<Tuple2<String,
> String>>();
>
>   list.add(new Tuple2<String, String>("key1", "val1"));
>
>  list.add(new Tuple2<String, String>("key1", "val1"));
>
>  list.add(new Tuple2<String, String>("key2", "val2"));
>
>  list.add(new Tuple2<String, String>("key2", "val2"));
>
>  list.add(new Tuple2<String, String>("key2", "val22"));
>
>     JavaPairRDD<String, Integer> rdd =  jsc.parallelize(list).mapToPair(t
> -> new Tuple2<String, Integer>(t._1 + " " +t._2, 1));
>
>   JavaPairRDD<String, Integer> rdd2 = rdd.reduceByKey((c1, c2) -> c1+c2 );
>
>     List<Tuple2<String, Integer>> output =  rdd2.collect();
>
>   for (Tuple2<?,?> tuple : output) {
>
>         System.out.println( tuple._1() + " : " + tuple._2() );
>
>     }
>
>   }
>
> On Sun, Jul 19, 2015 at 2:28 PM, Jerry Lam <chilinglam@gmail.com> wrote:
>
>> You mean this does not work?
>>
>> SELECT key, count(value) from table group by key
>>
>>
>>
>> On Sun, Jul 19, 2015 at 2:28 PM, N B <nb.nospam@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> How do I go about performing the equivalent of the following SQL clause
>>> in Spark Streaming? I will be using this on a Windowed DStream.
>>>
>>> SELECT key, count(distinct(value)) from table group by key;
>>>
>>> so for example, given the following dataset in the table:
>>>
>>>  key | value
>>> -----+-------
>>>  k1  | v1
>>>  k1  | v1
>>>  k1  | v2
>>>  k1  | v3
>>>  k1  | v3
>>>  k2  | vv1
>>>  k2  | vv1
>>>  k2  | vv2
>>>  k2  | vv2
>>>  k2  | vv2
>>>  k3  | vvv1
>>>  k3  | vvv1
>>>
>>> the result will be:
>>>
>>>  key | count
>>> -----+-------
>>>  k1  |     3
>>>  k2  |     2
>>>  k3  |     1
>>>
>>> Thanks
>>> Nikunj
>>>
>>>
>>
>

Mime
View raw message