spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricardo Barona <ricardo.a.baron...@gmail.com>
Subject Re: ml_ops.sh fails with NumberFormatException when reading flow_scores.csv
Date Tue, 23 Jan 2018 04:34:21 GMT
Hi Christos,

Curtis is absolutely right, what you need to pass is feedback. This is the
only part of the processes closely titgh to spot-oa. After scoring with
spot ml, spot OA will show the top N connections less probable to occur,
then security experts should determine if it’s actually an attack or a
false positive. After that a feedback will be saved in the location
mentioned by Curtis.

I can share the fields and format of a feedback file if you just want to
“recreate” the flow.

Let me know.

On Mon, Jan 22, 2018 at 8:32 AM Curtis Howard <curtis@cloudera.com> wrote:

> Hi Christos,
>
> Your application seems to be using netflow *results* rather than a
> *feedback* file.  As you mention, the feedback file uses a "\t"
> delimiter, and the following schema:
>
> https://github.com/apache/incubator-spot/blob/ab11e8c8a00b137aafff60c85cadc5edb8150020/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala#L62
>
> By default, ml_ops.sh looks for the feedback file at the following HDFS
> path ($HPATH defined in /etc/spot.conf):
> ${HPATH}/feedback/ml_feedback.csv
> relevant code:
> https://github.com/apache/incubator-spot/blob/ab11e8c8a00b137aafff60c85cadc5edb8150020/spot-ml/ml_ops.sh#L97
>
> In addition to this user mail list, there's also a Spot channel on Slack,
> which you can use to ask questions:  http://slack.apache-spot.io/
>
> Hope this helps
>
> Curtis
>
> On Fri, Jan 19, 2018 at 4:30 AM, Christos Mathas <mathas.ch.m@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm running ml_ops.sh and I have scored previous results so ml tries to
>> read the data from flow_scores.csv . It fails in stage 2 and the output is
>> this:
>>
>>
>> [Stage 2:>                                                          (0 +
>> 2) / 4]18/01/19 11:13:57 WARN scheduler.TaskSetManager: Lost task 2.0 in
>> stage 2.0 (TID 5, cloudera-host-2.shield.com, executor 1):
>> java.lang.NumberFormatException: For input string: "0,2018-01-18 09:35:42,
>> 193.93.167.241
>> ,10.101.30.60,123,123,UDP,2,152,0,0,3.0071374283430035E-5,56,,,,,,,,"
>>     at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>     at java.lang.Integer.parseInt(Integer.java:492)
>>     at java.lang.Integer.parseInt(Integer.java:527)
>>     at
>> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>>     at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>>     at
>> org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
>>     at
>> org.apache.spot.netflow.model.FlowFeedback$$anonfun$loadFeedbackDF$2.apply(FlowFeedback.scala:85)
>>     at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>     at
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>>     at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>>     at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>     at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>     at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>     at java.lang.Thread.run(Thread.java:745)
>>
>> .
>>
>> .
>>
>> .
>>
>> As you can see the problem is that it attempts to read the whole line, it
>> hasn't split it. My understanding is that the file responsible for parsing
>> the csv is FlowFeedback.scala (
>> https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowFeedback.scala).
>> I saw in the code that it splits the data by "\t", so I checked the
>> flow_scores.csv and found out that it is comma(",") seperated and not "\t".
>> I tried replacing "\t" with "," but I got the exact same error. I don't
>> know scala programming so I'm asking for your help as to how I could fix
>> this.
>>
>> Thank you in advance
>>
>>
>

Mime
View raw message