spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <>
Subject [PySpark]: reading arbitrary Hadoop InputFormats
Date Fri, 25 Oct 2013 06:18:03 GMT
Hi Spark Devs

I was wondering what appetite there may be to add the ability for PySpark
users to create RDDs from (somewhat) arbitrary Hadoop InputFormats.

In my data pipeline for example, I'm currently just using Scala (partly
because I love it but also because I am heavily reliant on quite custom
Hadoop InputFormats for reading data). However, many users may prefer to
use PySpark as much as possible (if not for everything). Reasons might
include the need to use some Python library. While I don't do it yet, I can
certainly see an attractive use case for using say scikit-learn / numpy to
do data analysis & machine learning in Python. Added to this my cofounder
knows Python well but not Scala so it can be very beneficial to do a lot of
stuff in Python.

For text-based data this is fine, but reading data in from more complex
Hadoop formats is an issue.

The current approach would of course be to write an ETL-style Java/Scala
job and then process in Python. Nothing wrong with this, but I was thinking
about ways to allow Python to access arbitrary Hadoop InputFormats.

Here is a quick proof of concept:

This works for simple stuff like SequenceFile with simple Writable

To work with more complex files, perhaps an approach is to manipulate
Hadoop JobConf via Python and pass that in. The one downside is of course
that the InputFormat (well actually the Key/Value classes) must have a
toString that makes sense so very custom stuff might not work.

I wonder if it would be possible to take the objects that are yielded via
the InputFormat and convert them into some representation like ProtoBuf,
MsgPack, Avro, JSON, that can be read relatively more easily from Python?

Another approach could be to allow a simple "wrapper API" such that one can
write a wrapper function T => String and pass that into an
InputFormatWrapper that takes an arbitrary InputFormat and yields Strings
for the keys and values. Then all that is required is to compile that
function and add it to the SPARK_CLASSPATH and away you go!



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message