spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-15420) Repartition and sort before Parquet writes
Date Thu, 19 May 2016 23:34:12 GMT

     [ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-15420:
------------------------------------

    Assignee: Apache Spark

> Repartition and sort before Parquet writes
> ------------------------------------------
>
>                 Key: SPARK-15420
>                 URL: https://issues.apache.org/jira/browse/SPARK-15420
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Ryan Blue
>            Assignee: Apache Spark
>
> Parquet requires buffering data in memory before writing a group of rows organized by
column. This causes significant memory pressure when writing partitioned output because each
open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the {{WriterContainer}} to
avoid keeping many files open at once. But, this isn't a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted correctly.
For example, a global sort will cause two sorts to happen, even if the global sort correctly
prepares the data.
> * To prevent a large number of output small output files, users must manually add a repartition
step. That step is also ignored by the sort within the writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should detect if
the incoming data is already sorted. The {{DataFrameWriter}} should also expose the ability
to repartition data before the write stage, and the query planner should expose an option
to automatically insert repartition operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message