spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Aberline <peter.aberl...@gmail.com>
Subject Re: PySpark sequence file support
Date Fri, 18 Oct 2013 11:56:40 GMT

On 18 Oct 2013, at 10:10, Peter Aberline <peter.aberline@gmail.com> wrote:

> Hi
> 
> I've just noticed that the ability to read sequence files does not look like it's been
implemented yet by the PySpark API? 
> 
> Would it be a difficult task for me to add this feature without being familiar with the
code base?
> 
> Alternatively, is there any work around for this? My data is in a single very large sequence
file containing > 250,000 elements. My code is already in python. I'm writing the sequence
file using Pydoop, so perhaps there is a way to build a RDD by reading in via Pydoop?
> 
> Thanks,
> Peter


Hi again,

I've been taking a look at the source to see how hard it would be to implement this and I
can see that many python API methods are simply wrappers that call methods in the Scala/Java
API using a python managed 'JavaSparkContext'.

So far, so good. I think I should be able to add a corresponding sequenceFile method to context.py
which calls the corresponding method in the JavaSparkContext. However, I'm struggling with
how to represent the key and value types in python and have them automagically mapped to Java
types?

Of course, if I get this working a PR will follow.

Thanks
Peter
Mime
View raw message