spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manoj Samel <>
Subject Handling occasional bad data ...
Date Thu, 23 Jan 2014 04:04:59 GMT

How does spark handles following case?

Thousands of CSV files (each with 50MB size) comes from external system.
One RDD is defined on all of these. RDD defines some of the CSV fields as
BigDecimal etc. When building the RDD, it errors out saying bad BigDecimal
format after some time (error shows max retries 4).

1) It is very likely that massive dataset will have occasional bad rows. It
is not possible to fix this data set or do a pre-processing on it to
eliminate bad data. How does spark handles it? Is it possible to say ignore
first N bad rows etc. ?

2) What was the max 4 retries in error message? Any way to control it?


View raw message