spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christos Mathas <mathas.c...@gmail.com>
Subject Re: ml_ops.sh fails with NumberFormatException when reading flow_scores.csv
Date Tue, 23 Jan 2018 08:35:07 GMT
I accidentally replied only to Curtis and not to the list, so I'm 
replying to the list because the problem has been resolved:

Actually I have an older version of Apache Spot, so the code I'm running 
has a lot of differences from the one in github. I have made changes at 
the FlowFeedback.scala and was able to parse the file correctly. Thank 
you for your time


On 01/23/2018 06:34 AM, Ricardo Barona wrote:
> Hi Christos,
>
> Curtis is absolutely right, what you need to pass is feedback. This is 
> the only part of the processes closely titgh to spot-oa. After scoring 
> with spot ml, spot OA will show the top N connections less probable to 
> occur, then security experts should determine if it’s actually an 
> attack or a false positive. After that a feedback will be saved in the 
> location mentioned by Curtis.
>
> I can share the fields and format of a feedback file if you just want 
> to “recreate” the flow.
>
> Let me know.
>
> On Mon, Jan 22, 2018 at 8:32 AM Curtis Howard <curtis@cloudera.com 
> <mailto:curtis@cloudera.com>> wrote:
>
>     Hi Christos,
>
>     Your application seems to be using netflow /results/ rather than a
>     /feedback/ file.  As you mention, the feedback file uses a "\t"
>     delimiter, and the following schema:
>     https://github.com/apache/incubator-spot/blob/ab11e8c8a00b137aafff60c85cadc5edb8150020/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala#L62
>
>     By default, ml_ops.sh looks for the feedback file at the following
>     HDFS path ($HPATH defined in /etc/spot.conf):
>     ${HPATH}/feedback/ml_feedback.csv
>     relevant code:
>     https://github.com/apache/incubator-spot/blob/ab11e8c8a00b137aafff60c85cadc5edb8150020/spot-ml/ml_ops.sh#L97
>
>     In addition to this user mail list, there's also a Spot channel on
>     Slack, which you can use to ask questions:
>     http://slack.apache-spot.io/
>
>     Hope this helps
>
>     Curtis
>
>     On Fri, Jan 19, 2018 at 4:30 AM, Christos Mathas
>     <mathas.ch.m@gmail.com <mailto:mathas.ch.m@gmail.com>> wrote:
>
>         Hi,
>
>         I'm running ml_ops.sh and I have scored previous results so ml
>         tries to read the data from flow_scores.csv . It fails in
>         stage 2 and the output is this:
>
>
>         [Stage 2:> (0 + 2) / 4]18/01/19 11:13:57 WARN
>         scheduler.TaskSetManager: Lost task 2.0 in stage 2.0 (TID 5,
>         cloudera-host-2.shield.com
>         <http://cloudera-host-2.shield.com>, executor 1):
>         java.lang.NumberFormatException: For input string:
>         "0,2018-01-18 09:35:42,193.93.167.241
>         <tel:193.93.167.241>,10.101.30.60,123,123,UDP,2,152,0,0,3.0071374283430035E-5,56,,,,,,,,"
>             at
>         java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>             at java.lang.Integer.parseInt(Integer.java:492)
>             at java.lang.Integer.parseInt(Integer.java:527)
>             at
>         scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>             at
>         scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>             at
>         org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
>             at
>         org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
>             at
>         scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>             at
>         scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>             at
>         org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>             at
>         org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>             at
>         org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>             at
>         org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>             at org.apache.spark.scheduler.Task.run(Task.scala:89)
>             at
>         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
>             at
>         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>             at
>         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>             at java.lang.Thread.run(Thread.java:745)
>
>         .
>
>         .
>
>         .
>
>         As you can see the problem is that it attempts to read the
>         whole line, it hasn't split it. My understanding is that the
>         file responsible for parsing the csv is FlowFeedback.scala
>         (https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala).
>         I saw in the code that it splits the data by "\t", so I
>         checked the flow_scores.csv and found out that it is
>         comma(",") seperated and not "\t". I tried replacing "\t" with
>         "," but I got the exact same error. I don't know scala
>         programming so I'm asking for your help as to how I could fix
>         this.
>
>         Thank you in advance
>
>


Mime
View raw message