spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andre Schumacher <schum...@icsi.berkeley.edu>
Subject Re: PySpark sequence file support
Date Mon, 21 Oct 2013 20:21:16 GMT

Hi Peter,

just some idea: if you wouldn't mind a preprocessing step you could
maybe use Pydoop to write out a sequence of file that contain your
pickled Python objects and read these into PySpark (see
read_from_pickle_file inside serializers.py).

How large is your input in total? Does it fir in one machine? Do you
have "complicated" nested objects?

BTW: just our of curiousity, what do you use Pydoop for? Some
bioinformatics related things?

Andre

On 10/18/2013 04:56 AM, Peter Aberline wrote:
> 
> On 18 Oct 2013, at 10:10, Peter Aberline <peter.aberline@gmail.com> wrote:
> 
>> Hi
>>
>> I've just noticed that the ability to read sequence files does not look like it's
been implemented yet by the PySpark API? 
>>
>> Would it be a difficult task for me to add this feature without being familiar with
the code base?
>>
>> Alternatively, is there any work around for this? My data is in a single very large
sequence file containing > 250,000 elements. My code is already in python. I'm writing
the sequence file using Pydoop, so perhaps there is a way to build a RDD by reading in via
Pydoop?
>>
>> Thanks,
>> Peter
> 
> 
> Hi again,
> 
> I've been taking a look at the source to see how hard it would be to implement this and
I can see that many python API methods are simply wrappers that call methods in the Scala/Java
API using a python managed 'JavaSparkContext'.
> 
> So far, so good. I think I should be able to add a corresponding sequenceFile method
to context.py which calls the corresponding method in the JavaSparkContext. However, I'm struggling
with how to represent the key and value types in python and have them automagically mapped
to Java types?
> 
> Of course, if I get this working a PR will follow.
> 
> Thanks
> Peter
> 


Mime
View raw message