spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manoj Samel <manojsamelt...@gmail.com>
Subject Re: Handling occasional bad data ...
Date Thu, 23 Jan 2014 17:16:38 GMT
Thanks Prashant


On Thu, Jan 23, 2014 at 5:00 AM, Prashant Sharma <scrapcodes@gmail.com>wrote:

> spark.task.maxFailures
>  http://spark.incubator.apache.org/docs/latest/configuration.html
>
>
> On Thu, Jan 23, 2014 at 10:18 AM, Andrew Ash <andrew@andrewash.com> wrote:
>
>> Why can't you preprocess to filter out the bad rows?  I often do this on
>> CSV files by testing if the raw line is "parseable" before splitting on ","
>> or similar.  Just validate the line before attempting to apply BigDecimal
>> or anything like that.
>>
>> Cheers,
>> Andrew
>>
>>
>> On Wed, Jan 22, 2014 at 9:04 PM, Manoj Samel <manojsameltech@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> How does spark handles following case?
>>>
>>> Thousands of CSV files (each with 50MB size) comes from external system.
>>> One RDD is defined on all of these. RDD defines some of the CSV fields as
>>> BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal
>>> format after some time (error shows max retries 4).
>>>
>>> 1) It is very likely that massive dataset will have occasional bad rows.
>>> It is not possible to fix this data set or do a pre-processing on it to
>>> eliminate bad data. How does spark handles it? Is it possible to say ignore
>>> first N bad rows etc. ?
>>>
>>> 2) What was the max 4 retries in error message? Any way to control it?
>>>
>>> Thanks,
>>>
>>>
>>>
>>
>
>
> --
> Prashant
>

Mime
View raw message