spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Handling occasional bad data ...
Date Thu, 23 Jan 2014 04:48:55 GMT
Why can't you preprocess to filter out the bad rows?  I often do this on
CSV files by testing if the raw line is "parseable" before splitting on ","
or similar.  Just validate the line before attempting to apply BigDecimal
or anything like that.

Cheers,
Andrew


On Wed, Jan 22, 2014 at 9:04 PM, Manoj Samel <manojsameltech@gmail.com>wrote:

> Hi,
>
> How does spark handles following case?
>
> Thousands of CSV files (each with 50MB size) comes from external system.
> One RDD is defined on all of these. RDD defines some of the CSV fields as
> BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal
> format after some time (error shows max retries 4).
>
> 1) It is very likely that massive dataset will have occasional bad rows.
> It is not possible to fix this data set or do a pre-processing on it to
> eliminate bad data. How does spark handles it? Is it possible to say ignore
> first N bad rows etc. ?
>
> 2) What was the max 4 retries in error message? Any way to control it?
>
> Thanks,
>
>
>

Mime
View raw message