spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Aberline <>
Subject Re: PySpark sequence file support
Date Tue, 22 Oct 2013 08:36:06 GMT
On 21 October 2013 21:21, Andre Schumacher <>wrote:

> Hi Peter,
> just some idea: if you wouldn't mind a preprocessing step you could
> maybe use Pydoop to write out a sequence of file that contain your
> pickled Python objects and read these into PySpark (see
> read_from_pickle_file inside
> How large is your input in total? Does it fir in one machine? Do you
> have "complicated" nested objects?
> BTW: just our of curiousity, what do you use Pydoop for? Some
> bioinformatics related things?
> Andre
> On 10/18/2013 04:56 AM, Peter Aberline wrote:
> >
> > On 18 Oct 2013, at 10:10, Peter Aberline <>
> wrote:
> >
> >> Hi
> >>
> >> I've just noticed that the ability to read sequence files does not look
> like it's been implemented yet by the PySpark API?
> >>
> >> Would it be a difficult task for me to add this feature without being
> familiar with the code base?
> >>
> >> Alternatively, is there any work around for this? My data is in a
> single very large sequence file containing > 250,000 elements. My code is
> already in python. I'm writing the sequence file using Pydoop, so perhaps
> there is a way to build a RDD by reading in via Pydoop?
> >>
> >> Thanks,
> >> Peter
> >
> >
> > Hi again,
> >
> > I've been taking a look at the source to see how hard it would be to
> implement this and I can see that many python API methods are simply
> wrappers that call methods in the Scala/Java API using a python managed
> 'JavaSparkContext'.
> >
> > So far, so good. I think I should be able to add a corresponding
> sequenceFile method to which calls the corresponding method in
> the JavaSparkContext. However, I'm struggling with how to represent the key
> and value types in python and have them automagically mapped to Java types?
> >
> > Of course, if I get this working a PR will follow.
> >
> > Thanks
> > Peter
> >
Hi Andre,

Thanks for the suggestions, if I can't get the sc.sequenceFile method
working then I will use follow suggestion and use pydoop to build a local
python collection and then create a RDD from that.

The sequence files are basically just a list of pickled pandas data frames,
with the key value representing the name of the data frame. It's historical
market data.

The file could fit on the one machine, at a pinch, but I'll have to use a
very large master to fit into RAM on the master machine.

By using sc.sequenceFile I was hoping to avoid reading it all in on the
master, and to take advantage of data locality and placement information
available from HDFS, so that workers could stream just 'their' part of the
file into their memory.

I've not given up yet on getting sc.sequenceFile to work, I'll take another
look at it again in the next day or so.


View raw message