spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Wendell (JIRA)" <>
Subject [jira] [Resolved] (SPARK-4417) New API: sample RDD to fixed number of items
Date Mon, 19 Jan 2015 09:58:34 GMT


Patrick Wendell resolved SPARK-4417.
    Resolution: Won't Fix
      Assignee: Ilya Ganelin

[~ilganeli] ended up taking a crack a this, but we decided not to include the feature based
on follow up discussion in the PR.

> New API: sample RDD to fixed number of items
> --------------------------------------------
>                 Key: SPARK-4417
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, Spark Core
>            Reporter: Davies Liu
>            Assignee: Ilya Ganelin
> Sometimes, we just want to a fixed number of items randomly selected from an RDD, for
example, before sort an RDD we need to gather a fixed number of keys from each partitions.
> In order to do this, we need to two pass on the RDD, get the total number, then calculate
the right ratio for sampling. In fact, we could do this in one pass.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message