spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Issue with rogue data in csv file used in Spark application
Date Tue, 27 Sep 2016 22:06:38 GMT
Thanks guys

Actually these are the 7 rogue rows. The column 0 is the Volume column
which means there was no trades on those days


*cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0

So one way would be to exclude the rows that there was no volume of trade
that day when cleaning up the csv file

*cat stock.csv|grep -v **",0"*

and that works. Bearing in mind that putting 0s in place of "-" will skew
the price plot.

BTW I am using Spark csv as well

val df1 = spark.read.option("header", true).csv(location)

This is the class and the mapping


case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
Float, High: Float, Low: Float, Close: Float, Volume: Integer)
val df2 = df1.map(p => columns(p(0).toString, p(1).toString, p(2).toString,
p(3).toString.toFloat, p(4).toString.toFloat, p(5).toString.toFloat,
p(6).toString.toFloat, p(7).toString.toInt))


In here I have

p(3).toString.toFloat

How can one check for rogue data in p(3)?


Thanks





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 September 2016 at 21:49, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

>
> I have historical prices for various stocks.
>
> Each csv file has 10 years trade one row per each day.
>
> These are the columns defined in the class
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>
> The issue is with Open, High, Low, Close columns that all are defined as
> Float.
>
> Most rows are OK like below but the red one with "-" defined as Float
> causes issues
>
>   Date     Open High  Low   Close Volume
> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
> 23-Dec-11   -     -    -    40.56 0
>
> Because the prices are defined as Float, these rows cause the application
> to crash
> scala> val rs = df2.filter(changeToDate("TradeDate") >=
> monthsago).select((changeToDate("TradeDate").as("
> TradeDate")),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's
> Low"), 'High.as("Day's High")).orderBy("TradeDate").collect
> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
> 260)
> java.lang.NumberFormatException: For input string: "-"
>
>
> One way is to define the prices as Strings but that is not
> meaningful. Alternatively do the clean up before putting csv in HDFS but
> that becomes tedious and error prone.
>
> Any ideas will be appreciated.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Mime
View raw message