spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christos Mathas <mathas.c...@gmail.com>
Subject ml_ops.sh fails with NumberFormatException when reading flow_scores.csv
Date Fri, 19 Jan 2018 09:30:33 GMT
Hi,

I'm running ml_ops.sh and I have scored previous results so ml tries to 
read the data from flow_scores.csv . It fails in stage 2 and the output 
is this:


[Stage 2:>                                                         
(0 + 
2) / 4]18/01/19 11:13:57 WARN scheduler.TaskSetManager: Lost task 2.0 in 
stage 2.0 (TID 5, cloudera-host-2.shield.com, executor 1): 
java.lang.NumberFormatException: For input string: "0,2018-01-18 
09:35:42,193.93.167.241,10.101.30.60,123,123,UDP,2,152,0,0,3.0071374283430035E-5,56,,,,,,,,"
     at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
     at java.lang.Integer.parseInt(Integer.java:492)
     at java.lang.Integer.parseInt(Integer.java:527)
     at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
     at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
     at 
org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
     at 
org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
     at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
     at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
     at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
     at org.apache.spark.scheduler.Task.run(Task.scala:89)
     at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:745)

.

.

.

As you can see the problem is that it attempts to read the whole line, 
it hasn't split it. My understanding is that the file responsible for 
parsing the csv is FlowFeedback.scala 
(https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala).

I saw in the code that it splits the data by "\t", so I checked the 
flow_scores.csv and found out that it is comma(",") seperated and not 
"\t". I tried replacing "\t" with "," but I got the exact same error. I 
don't know scala programming so I'm asking for your help as to how I 
could fix this.

Thank you in advance


Mime
View raw message