spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-30185) Implement Dataset.tail API
Date Mon, 30 Dec 2019 16:08:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-30185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-30185.
----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 26809
[https://github.com/apache/spark/pull/26809]

> Implement Dataset.tail API
> --------------------------
>
>                 Key: SPARK-30185
>                 URL: https://issues.apache.org/jira/browse/SPARK-30185
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>            Priority: Major
>             Fix For: 3.0.0
>
>
> I would like to propose an API called DataFrame.tail.
> *Background & Motivation*
> Many other systems support the way to take data from the end, for instance, pandas[1]
and
>  Python[2][3]. Scala collections APIs also have head and tail
> On the other hand, in Spark, we only provide a way to take data from the start
>  (e.g., DataFrame.head). This has been requested multiple times here and there in Spark
>  user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such
as
>  Koalas[8].
> It seems we're missing non-trivial use case in Spark and this motivated me to propose
this
>  API.
> *Proposal*
> I would like to propose an API against DataFrame called tail that collects rows from
the
>  end in contrast with head.
> Namely, as below:
> {code:java}
>  scala> spark.range(10).head(5)
>  res1: Array[Long] = Array(0, 1, 2, 3, 4)
>  scala> spark.range(10).tail(5)
>  res2: Array[Long] = Array(5, 6, 7, 8, 9){code}
> Implementation details will be similar with head but it will be reversed:
> Run the job against the last partition and collect rows. If this is enough, return as
is.
>  If this is not enough, calculate the number of partitions to select more based upon
>  ‘spark.sql.limit.scaleUpFactor’
>  Run more jobs against more partitions (in a reversed order compared to head)
>  as many as the number calculated from 2.
>  Go to 2.
>  [1] [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail]
>  [2] [https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line]
>  [3] [https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python]
>  [4] [http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html]
>  [5] [https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index]
>  [6] [https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe]
>  [7] https://issues.apache.org/jira/browse/SPARK-26433
>  [8] [https://github.com/databricks/koalas/issues/343]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message