spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Vergiliev (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey
Date Thu, 04 Dec 2014 12:19:12 GMT
Ivan Vergiliev created SPARK-4743:
-------------------------------------

             Summary: Use SparkEnv.serializer instead of closureSerializer in aggregateByKey
and foldByKey
                 Key: SPARK-4743
                 URL: https://issues.apache.org/jira/browse/SPARK-4743
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Ivan Vergiliev


AggregateByKey and foldByKey in PairRDDFunctions both use the closure serializer to serialize
and deserialize the initial value. This means that the Java serializer is always used, which
can be very expensive if there's a large number of groups. Calling combineByKey manually and
using the normal serializer instead of the closure one improved the performance on the dataset
I'm testing with by about 30-35%.

I'm not familiar enough with the codebase to be certain that replacing the serializer here
is OK, but it works correctly in my tests, and it's only serializing a single value of type
U, which should be serializable by the default one since it can be the output of a job. Let
me know if I'm missing anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message