spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Rosen <>
Subject Re: [PySpark]: reading arbitrary Hadoop InputFormats
Date Fri, 25 Oct 2013 18:00:27 GMT
Hi Nick,

I've seen several requests for SequenceFile support in PySpark, so there's
definitely demand for this feature.

I like the idea of passing MsgPack'ed data (or some other structured
format) from Java to the Python workers.  My early prototype of custom
serializers (described at
might be useful for implementing this.  Proper custom serializer support
would handle the bookkeeping for tracking each stage's input and output
formats and supplying the appropriate deserialization functions to the
Python worker, so the Python worker would be able to directly read the
MsgPack'd data that's sent to it.

Regarding a wrapper API, it's actually possible to initially transform data
using Scala/Java and perform the remainder of the processing in PySpark.
 This involves adding the appropriate compiled to the Java classpath and a
bit of work in Py4J to create the Java/Scala RDD and wrap it for use by
PySpark.  I can hack together a rough example of this if anyone's
interested, but it would need some work to be developed into a
user-friendly API.

If you wanted to extend your proof-of-concept to handle the cases where
keys and values have parseable toString() values, I think you could remove
the need for a delimiter by creating a PythonRDD from the newHadoopFile
JavaPairRDD and adding a new method to writeAsPickle (
to dump its contents as a pickled pair of strings.  (Aside: most of
writeAsPickle() would probably need be eliminated or refactored when adding
general custom serializer support).

- Josh

On Thu, Oct 24, 2013 at 11:18 PM, Nick Pentreath

> Hi Spark Devs
> I was wondering what appetite there may be to add the ability for PySpark
> users to create RDDs from (somewhat) arbitrary Hadoop InputFormats.
> In my data pipeline for example, I'm currently just using Scala (partly
> because I love it but also because I am heavily reliant on quite custom
> Hadoop InputFormats for reading data). However, many users may prefer to
> use PySpark as much as possible (if not for everything). Reasons might
> include the need to use some Python library. While I don't do it yet, I can
> certainly see an attractive use case for using say scikit-learn / numpy to
> do data analysis & machine learning in Python. Added to this my cofounder
> knows Python well but not Scala so it can be very beneficial to do a lot of
> stuff in Python.
> For text-based data this is fine, but reading data in from more complex
> Hadoop formats is an issue.
> The current approach would of course be to write an ETL-style Java/Scala
> job and then process in Python. Nothing wrong with this, but I was thinking
> about ways to allow Python to access arbitrary Hadoop InputFormats.
> Here is a quick proof of concept:
> This works for simple stuff like SequenceFile with simple Writable
> key/values.
> To work with more complex files, perhaps an approach is to manipulate
> Hadoop JobConf via Python and pass that in. The one downside is of course
> that the InputFormat (well actually the Key/Value classes) must have a
> toString that makes sense so very custom stuff might not work.
> I wonder if it would be possible to take the objects that are yielded via
> the InputFormat and convert them into some representation like ProtoBuf,
> MsgPack, Avro, JSON, that can be read relatively more easily from Python?
> Another approach could be to allow a simple "wrapper API" such that one can
> write a wrapper function T => String and pass that into an
> InputFormatWrapper that takes an arbitrary InputFormat and yields Strings
> for the keys and values. Then all that is required is to compile that
> function and add it to the SPARK_CLASSPATH and away you go!
> Thoughts?
> Nick

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message