spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georg Heiler <georg.kf.hei...@gmail.com>
Subject Re: Collecting Multiple Aggregation query result on one Column as collectAsMap
Date Tue, 29 Aug 2017 10:22:52 GMT
What about a custom UADF?
Patrick <titlibatali@gmail.com> schrieb am Mo. 28. Aug. 2017 um 20:54:

> ok . i see there is a describe() function which does the stat calculation
> on dataset similar to StatCounter but however i dont want to restrict my
> aggregations to standard mean, stddev etc and generate some custom stats ,
> or also may not run all the predefined stats but only subset of them on the
> particular column.
> I was thinking if we need to write some custom code which does this in one
> action(job) that would work for me
>
>
>
> On Tue, Aug 29, 2017 at 12:02 AM, Georg Heiler <georg.kf.heiler@gmail.com>
> wrote:
>
>> Rdd only
>> Patrick <titlibatali@gmail.com> schrieb am Mo. 28. Aug. 2017 um 20:13:
>>
>>> Ah, does it work with Dataset API or i need to convert it to RDD first ?
>>>
>>> On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler <
>>> georg.kf.heiler@gmail.com> wrote:
>>>
>>>> What about the rdd stat counter?
>>>> https://spark.apache.org/docs/0.6.2/api/core/spark/util/StatCounter.html
>>>>
>>>> Patrick <titlibatali@gmail.com> schrieb am Mo. 28. Aug. 2017 um 16:47:
>>>>
>>>>> Hi
>>>>>
>>>>> I have two lists:
>>>>>
>>>>>
>>>>>    - List one: contains names of columns on which I want to do
>>>>>    aggregate operations.
>>>>>    - List two: contains the aggregate operations on which I want to
>>>>>    perform on each column eg ( min, max, mean)
>>>>>
>>>>> I am trying to use spark 2.0 dataset to achieve this. Spark provides
>>>>> an agg() where you can pass a Map <String,String> (of column name
and
>>>>> respective aggregate operation ) as input, however I want to perform
>>>>> different aggregation operations on the same column of the data and want
to
>>>>> collect the result in a Map<String,String> where key is the aggregate
>>>>> operation and Value is the result on the particular column.  If i add
>>>>> different agg() to same column, the key gets updated with latest value.
>>>>>
>>>>> Also I dont find any collectAsMap() operation that returns map of
>>>>> aggregated column name as key and result as value. I get collectAsList()
>>>>> but i dont know the order in which those agg() operations are run so
how do
>>>>> i match which list values corresponds to which agg operation.  I am able
to
>>>>> see the result using .show() but How can i collect the result in this
case ?
>>>>>
>>>>> Is it possible to do different aggregation on the same column in one
>>>>> Job(i.e only one collect operation) using agg() operation?
>>>>>
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>>
>>>
>

Mime
View raw message