spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-27560) HashPartitioner uses Object.hashCode which is not seeded
Date Tue, 02 Jul 2019 17:12:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-27560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-27560:
------------------------------------

    Assignee:     (was: Apache Spark)

> HashPartitioner uses Object.hashCode which is not seeded
> --------------------------------------------------------
>
>                 Key: SPARK-27560
>                 URL: https://issues.apache.org/jira/browse/SPARK-27560
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.4.0
>         Environment: Notebook is running spark v2.4.0 local[*]
> Python 3.6.6 (default, Sep  6 2018, 13:10:03)
> [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
> I imagine this would reproduce on all operating systems and most versions of spark though.
>            Reporter: Andrew McHarg
>            Priority: Minor
>
> Forgive the quality of the bug report here, I am a pyspark user and not super familiar
with the internals of spark, yet it seems I have a strange corner case with the HashPartitioner.
> This may already be known but repartition with HashPartitioner seems to assign everything
the same partition if data that was partitioned by the same column is only partially read
(say one partition). I suppose it is obvious concequence of Object.hashCode being deterministic
but took some while to track down. 
> Steps to repro:
>  # Get dataframe with a bunch of uuids say 10000
>  # repartition(100, 'uuid_column')
>  # save to parquet
>  # read from parquet
>  # collect()[:100] then filter using pyspark.sql.functions isin (yes I know this is bad
and sampleBy should probably be used here)
>  # repartition(10, 'uuid_column')
>  # Resulting dataframe will have all of its data in one single partition
> Jupyter notebook for the above: https://gist.github.com/robo-hamburger/4752a40cb643318464e58ab66cf7d23e
> I think an easy fix would be to seed the HashPartitioner like many hashtable libraries
do to avoid denial of service attacks. It also might be the case this is obvious behavior
for more experienced spark users :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message