spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: Issue with rogue data in csv file used in Spark application
Date Tue, 27 Sep 2016 21:52:48 GMT
You can read as string, write a map to fix rows and then convert back to
your desired Dataframe.
On 28 Sep 2016 06:49, "Mich Talebzadeh" <mich.talebzadeh@gmail.com> wrote:

>
> I have historical prices for various stocks.
>
> Each csv file has 10 years trade one row per each day.
>
> These are the columns defined in the class
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>
> The issue is with Open, High, Low, Close columns that all are defined as
> Float.
>
> Most rows are OK like below but the red one with "-" defined as Float
> causes issues
>
>   Date     Open High  Low   Close Volume
> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
> 23-Dec-11   -     -    -    40.56 0
>
> Because the prices are defined as Float, these rows cause the application
> to crash
> scala> val rs = df2.filter(changeToDate("TradeDate") >=
> monthsago).select((changeToDate("TradeDate").as("
> TradeDate")),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's
> Low"), 'High.as("Day's High")).orderBy("TradeDate").collect
> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
> 260)
> java.lang.NumberFormatException: For input string: "-"
>
>
> One way is to define the prices as Strings but that is not
> meaningful. Alternatively do the clean up before putting csv in HDFS but
> that becomes tedious and error prone.
>
> Any ideas will be appreciated.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Mime
View raw message