spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Resolved] (SPARK-19582) DataFrameReader conceptually inadequate
Date Tue, 14 Feb 2017 18:12:41 GMT


Sean Owen resolved SPARK-19582.
    Resolution: Invalid

I don't understand what this is describing. Is it a dependency conflict? if so, what? You
say DataFrameReader understands every possible source, but it doesn't of course. It is also
not designed to exclude a particular data source of course.

You can already supply a bunch of strings to make a Dataset of strings. 

Is this about Minio? what does 'forward' mean?

Rather than reply, please continue with more clarity on the mailing list. This isn't clear
or specific enough for a JIRA

> DataFrameReader conceptually inadequate
> ---------------------------------------
>                 Key: SPARK-19582
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.1.0
>            Reporter: James Q. Arnold
> DataFrameReader assumes it "understands" all data sources (local file system, object
stores, jdbc, ...).  This seems limiting in the long term, imposing both development costs
to accept new sources and dependency issues for existing sources (how to coordinate the XX
jar for internal use vs. the XX jar used by the application).  Unless I have missed how this
can be done currently, an application with an unsupported data source cannot create the required
RDD for distribution.
> I recommend at least providing a text API for supplying data.  Let the application provide
data as a String (or char[] or ...)---not a path, but the actual data.  Alternatively, provide
interfaces or abstract classes the application could provide to let the application handle
external data sources, without forcing all that complication into the Spark implementation.
> I don't have any code to submit, but JIRA seemed like to most appropriate place to raise
the issue.
> Finally, if I have overlooked how this can be done with the current API, a new example
would be appreciated.
> Additional detail...
> We use the minio object store, which provides an API compatible with AWS-S3.  A few configuration/parameter
values differ for minio, but one can use the AWS library in the application to connect to
the minio server.
> When trying to use minio objects through spark, the s3://xxx paths are intercepted by
spark and handed to hadoop.  So far, I have been unable to find the right combination of configuration
values and parameters to "convince" hadoop to forward the right information to work with minio.
 If I could read the minio object in the application, and then hand the object contents directly
to spark, I could bypass hadoop and solve the problem.  Unfortunately, the underlying spark
design prevents that.  So, I see two problems.
> -  Spark seems to have taken on the responsibility of "knowing" the API details of all
data sources.  This seems iffy in the long run (and is the root of my current problem).  In
the long run, it seems unwise to assume that spark should understand all possible path names,
protocols, etc.  Moreover, passing S3 paths to hadoop seems a little odd (why not go directly
to AWS, for example).  This particular confusion about S3 shows the difficulties that are
bound to occur.
> -  Second, spark appears not to have a way to bypass the path name interpretation.  At
the least, spark could provide a text/blob interface, letting the application supply the data
object and avoid path interpretation inside spark.  Alternatively, spark could accept a reader/stream/...
to build the object, again letting the application provide the implementation of the object
> As I mentioned above, I might be missing something in the API that lets us work around
the problem.  I'll keep looking, but the API as apparently structured seems too limiting.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message