spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Stoelinga <sammiest...@gmail.com>
Subject Spark Python with SequenceFile containing numpy deserialized data in str form
Date Tue, 09 Jun 2015 03:04:44 GMT
Hi all,

I'm storing an rdd as sequencefile with the following content:
key=filename(string) value=python str from numpy.savez(not unicode)

In order to make sure the whole numpy array get's stored I have to first
serialize it with:
def serialize_numpy_array(numpy_array):
    output = io.BytesIO()
    np.savez_compressed(output, x=numpy_array)
    return output.getvalue()

>> type(output.getvalue())
str

The deserialization returns a python str, *not unicode object*. After
deserialization I call

my_dersialized_numpy_rdd.saveAsSequenceFile(path)

all works well and the RDD get stored successfully. Now the problem starts
I want to read the sequencefile again:

>> my_dersialized_numpy_rdd = sc.sequenceFile(path)
>> first = my_dersialized_numpy_rdd.first()
>> type(first[1])
unicode

The previous str became a unicode object after we stored it to a
sequencefile and read it again. Trying to convert it back with
first[1].decode("ascii") fails with UnicodeEncodeError: 'ascii' codec can't
encode characters in position 1-3: ordinal not in range(128)

My expectation was that I would get the data back as how I stored it for
example in str format and not in unicode format. Anybody suggestion how I
can read back the original data. Will try converting the str to bytearray
before storing it to a seqeencefile.

Thanks,
Sam Stoelinga

Mime
View raw message