spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ran Haim (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
Date Thu, 27 Oct 2016 10:35:58 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611483#comment-15611483
] 

Ran Haim edited comment on SPARK-17436 at 10/27/16 10:35 AM:
-------------------------------------------------------------

Usually you partition the data, and then you order it - this way you preserve ordering.
The problem here occurs in the writer itself, the DataFrame itself is partitioned and ordered
correctly.

I would have some time to work on it next week or something like that, can I just do a pull
request and put it here?



was (Author: ran.haim@optimalplus.com):
usually you partition the data, and then you order it - this way you preserve ordering.
The problem here occurs in the writer itself, the DataFrame itself is partitioned and ordered
correctly.

I would have some time to work on it next week or something like that, can I just do a pull
request and put it here?


> dataframe.write sometimes does not keep sorting
> -----------------------------------------------
>
>                 Key: SPARK-17436
>                 URL: https://issues.apache.org/jira/browse/SPARK-17436
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered dataframe.
> The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it starts inserting
rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes
them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition key, and that
can sometimes mess up the original sort (or secondary sort if you will).
> I think the best way to fix it is to stop using a sorter, and just put the rows in a
map using key as partition key and value as an arraylist, and then just walk through all the
keys and write it in the original order - this will probably be faster as there no need for
ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message