spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashen Weerathunga <as...@wso2.com>
Subject Re: Calculating Min and Max Values using Spark Transformations?
Date Sun, 30 Aug 2015 08:00:25 GMT
Thanks everyone for the help!


On Sat, Aug 29, 2015 at 2:55 AM, Alexey Grishchenko <programmerag@gmail.com>
wrote:

> If the data is already in RDD, the easiest way to calculate min/max for
> each column would be an aggregate() function. It takes 2 functions as
> arguments - first is used to aggregate RDD values to your "accumulator",
> the second is used to merge two accumulators. This way both min and max for
> all the columns in your RDD would be calculated in a single pass over it.
> Here's an example in Python:
>
> def agg1(x,y):
>     if len(x) == 0: x = [y,y]
>     return [map(min,zip(x[0],y)),map(max,zip(x[1],y))]
>
> def agg2(x,y):
>     if len(x) == 0: x = y
>     return [map(min,zip(x[0],y[0])),map(max,zip(x[1],y[1]))]
>
> rdd  = sc.parallelize(xrange(100000), 5)
> rdd2 = rdd.map(lambda x: ([random.randint(1,100) for _ in xrange(15)]))
> rdd2.aggregate([], agg1, agg2)
>
> What personally I would do in your case depends on what else you want to
> do with the data. If you plan to run some more business logic on top of it
> and you're more comfortable with SQL, it might worth registering this
> DataFrame as a table and generating SQL query to it (generate a string with
> a series of min-max calls). But to solve your specific problem I'd load
> your file with textFile(), use map() transformation to split the string by
> comma and convert it to the array of doubles, and call aggregate() on top
> of it just like I've shown in the example above
>
> On Fri, Aug 28, 2015 at 6:15 PM, Burak Yavuz <brkyvz@gmail.com> wrote:
>
>> Or you can just call describe() on the dataframe? In addition to min-max,
>> you'll also get the mean, and count of non-null and non-NA elements as well.
>>
>> Burak
>>
>> On Fri, Aug 28, 2015 at 10:09 AM, java8964 <java8964@hotmail.com> wrote:
>>
>>> Or RDD.max() and RDD.min() won't work for you?
>>>
>>> Yong
>>>
>>> ------------------------------
>>> Subject: Re: Calculating Min and Max Values using Spark Transformations?
>>> To: ashen@wso2.com
>>> CC: user@spark.apache.org
>>> From: jfchen@us.ibm.com
>>> Date: Fri, 28 Aug 2015 09:28:43 -0700
>>>
>>>
>>> If you already loaded csv data into a dataframe, why not register it as
>>> a table, and use Spark SQL
>>> to find max/min or any other aggregates? SELECT MAX(column_name) FROM
>>> dftable_name ... seems natural.
>>>
>>>
>>>
>>>
>>>
>>>    *JESSE CHEN*
>>>    Big Data Performance | IBM Analytics
>>>
>>>    Office:  408 463 2296
>>>    Mobile: 408 828 9068
>>>    Email:   jfchen@us.ibm.com
>>>
>>>
>>>
>>> [image: Inactive hide details for ashensw ---08/28/2015 05:40:07 AM---Hi
>>> all, I have a dataset which consist of large number of feature]ashensw
>>> ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large
>>> number of features(columns). It is
>>>
>>> From: ashensw <ashen@wso2.com>
>>> To: user@spark.apache.org
>>> Date: 08/28/2015 05:40 AM
>>> Subject: Calculating Min and Max Values using Spark Transformations?
>>>
>>> ------------------------------
>>>
>>>
>>>
>>> Hi all,
>>>
>>> I have a dataset which consist of large number of features(columns). It
>>> is
>>> in csv format. So I loaded it into a spark dataframe. Then I converted it
>>> into a JavaRDD<Row> Then using a spark transformation I converted that
>>> into
>>> JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>.
So
>>> now
>>> I have a JavaRDD<double[]>. So is there any method to calculate max and
>>> min
>>> values of each columns in this JavaRDD<double[]> ?
>>>
>>> Or Is there any way to access the array if I store max and min values to
>>> a
>>> array inside the spark transformation class?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>>
>>
>
>
> --
> Best regards, Alexey Grishchenko
>
> phone: +353 (87) 262-2154
> email: ProgrammerAG@gmail.com
> web:   http://0x0fff.com
>



-- 
*Ashen Weerathunga*
Software Engineer - Intern
WSO2 Inc.: http://wso2.com
lean.enterprise.middleware

Email: ashen@wso2.com
Mobile: +94 716042995 <94716042995>
LinkedIn:
*http://lk.linkedin.com/in/ashenweerathunga
<http://lk.linkedin.com/in/ashenweerathunga>*

Mime
View raw message