spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <>
Subject [jira] [Created] (SPARK-17949) Introduce a JVM object based aggregate operator
Date Fri, 14 Oct 2016 23:49:20 GMT
Reynold Xin created SPARK-17949:

             Summary: Introduce a JVM object based aggregate operator
                 Key: SPARK-17949
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Reynold Xin

The new Tungsten execution engine has very robust memory management and speed for simple data
types. It does, however, suffer from the following:

1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is fairly expensive
to fit into the Tungsten internal format.

2. For aggregate functions that require complex intermediate data structures, Unsafe (on raw
bytes) is not a good programming abstraction due to the lack of structs.

The idea here is to introduce an JVM object based hash aggregate operator that can support
the aforementioned use cases. This operator, however, should limit its memory usage to avoid
putting too much pressure on GC, e.g. falling back to sort-based aggregate as soon the number
of objects exceed a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and have observed
substantial speed-ups over existing Spark.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message