spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <>
Subject [jira] [Closed] (SPARK-7025) Create a Java-friendly input source API
Date Sun, 01 May 2016 22:50:12 GMT


Reynold Xin closed SPARK-7025.
    Resolution: Later

> Create a Java-friendly input source API
> ---------------------------------------
>                 Key: SPARK-7025
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
> The goal of this ticket is to create a simple input source API that we can maintain and
support long term.
> Spark currently has two de facto input source API:
> 1. RDD
> 2. Hadoop MapReduce InputFormat
> Neither of the above is ideal:
> 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags.
In addition, the RDD API depends on Scala's runtime library, which does not preserve binary
compatibility across Scala versions. If a developer chooses Java to implement an input source,
it would be great if that input source can be binary compatible in years to come.
> 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example,
it forces key-value semantics, and does not support running arbitrary code on the driver side
(an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell
developers that in order to implement an input source for Spark, they should learn the Hadoop
MapReduce API first.
> So here's the proposal: an InputSource is described by:
> * an array of InputPartition that specifies the data partitioning
> * a RecordReader that specifies how data on each partition can be read
> This interface would be similar to Hadoop's InputFormat, except that there is no explicit
key/value separation.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message