spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-15420) Repartition and sort before Parquet writes
Date Thu, 19 May 2016 23:34:12 GMT


Apache Spark reassigned SPARK-15420:

    Assignee: Apache Spark

> Repartition and sort before Parquet writes
> ------------------------------------------
>                 Key: SPARK-15420
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Ryan Blue
>            Assignee: Apache Spark
> Parquet requires buffering data in memory before writing a group of rows organized by
column. This causes significant memory pressure when writing partitioned output because each
open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the {{WriterContainer}} to
avoid keeping many files open at once. But, this isn't a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted correctly.
For example, a global sort will cause two sorts to happen, even if the global sort correctly
prepares the data.
> * To prevent a large number of output small output files, users must manually add a repartition
step. That step is also ignored by the sort within the writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should detect if
the incoming data is already sorted. The {{DataFrameWriter}} should also expose the ability
to repartition data before the write stage, and the query planner should expose an option
to automatically insert repartition operations.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message