spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ilya Ganelin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4417) New API: sample RDD to fixed number of items
Date Mon, 08 Dec 2014 22:58:13 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238613#comment-14238613
] 

Ilya Ganelin commented on SPARK-4417:
-------------------------------------

Hi, I'd like to work on this. Can someone please assign it to me? Thank you. 

> New API: sample RDD to fixed number of items
> --------------------------------------------
>
>                 Key: SPARK-4417
>                 URL: https://issues.apache.org/jira/browse/SPARK-4417
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, Spark Core
>            Reporter: Davies Liu
>
> Sometimes, we just want to a fixed number of items randomly selected from an RDD, for
example, before sort an RDD we need to gather a fixed number of keys from each partitions.
> In order to do this, we need to two pass on the RDD, get the total number, then calculate
the right ratio for sampling. In fact, we could do this in one pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message