spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DB Tsai <...@netflix.com.INVALID>
Subject Re: Incomplete data when reading from S3
Date Thu, 17 Mar 2016 09:10:05 GMT
You need to use wholetextfiles to read the whole file at once. Otherwise,
It can be split.

DB Tsai - Sent From My Phone
On Mar 17, 2016 12:45 AM, "Blaž Šnuderl" <snuderl@gmail.com> wrote:

> Hi.
>
> We have json data stored in S3 (json record per line). When reading the
> data from s3 using the following code we started noticing json decode
> errors.
>
> sc.textFile(paths).map(json.loads)
>
>
> After a bit more investigation we noticed an incomplete line, basically
> the line was
>
>> {"key": "value", "key2":  <- notice the line abruptly ends with no json
>> close tag etc
>
>
> It is not an issue with our data and it doesn't happen very often, but it
> makes us very scared since it means spark could be dropping data.
>
> We are using spark 1.5.1. Any ideas why this happens and possible fixes?
>
> Regards,
> Blaž Šnuderl
>

Mime
View raw message