spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christos Mathas <mathas.c...@gmail.com>
Subject Re: Spot-ml parameters configuration
Date Wed, 31 Jan 2018 09:33:01 GMT
Hi,

I tried increasing the number of max iterations in steps. I did it with 
50, 70, 100, 150 and 200. It worked for all numbers except for 200, 
where I got this:

[Stage 37393:==================================================>(198 + 
2) / 200]Exception in thread "main" org.apache.spark.SparkException: Job 
aborted due to stage failure: Task serialization failed: 
java.lang.StackOverflowError

and the whole output of ml_ops.sh:

https://www.dropbox.com/s/maq0efs8c0xb8ga/ml.out?dl=0

I should also note that the last time it failed with the iterations set 
at 200, the error was about a task failing because Yarn lost an 
executor, but again due to java.lang.StackOverflowError. The only thing 
I have changed between the two executions is this: export 
JAVA_TOOL_OPTIONS="-Xmx16G", which I set in /etc/profile.d/ for 
everything in the system, to check if it was the default java memory 
value that was causing the problem.

So still stuck at fixing whatever is causing this. I don't have any more 
leads as to which configuration I should examine. When I do, I will also 
check the topics parameter as you suggested.

On your other questions:
"How large is your dataset?"
I will give as an example the dataset I used for the executions of 
ml_ops with the different iteration values as described above, which is 
~1Mb. Here is a screenshot from hdfs:

https://www.dropbox.com/s/c6hp937zvl77s3p/hdfs_hive.png?dl=0

"How many attacks.." "How are you generating.."
I have a Kali VM inside the network from which I deploy the attacks in 
real-time while the traffic is generated. I do one attack at a time. For 
this particular dataset I used Armitage to do an nmap scan and a Hail 
Mary attack which is a really "noisy" and not sophisticated attack.

"How are you computing AUROC?"
I am computing AUROC with MS Excel, by taking as input the first 100 
rows of the flow_results.csv. I have uploaded a file in dropbox so it is 
more clear as to how exactly I'm doing it.

https://www.dropbox.com/s/7k6erupp5jbpnpm/Hail%20Mary_9_ROC.xlsx?dl=0

Thank you

On 01/29/2018 04:49 PM, Lujan Moreno, Gustavo wrote:
>
> Hi,
>
> The number of iterations at 100 should be a priority. If you are not 
> able to run them you have a technical problem with your cluster and 
> you should fix it first. Once you are able to run 100 iterations then 
> you should start playing around with the number of topics. Start with 
> 5 topics, run 10 replicates, save the results (AUROC), change to 10 
> topics, do the same, then go to 20, 50, 100, etc. Once you have the 
> results, plot them (x axis number of topics, y axis the AUROC) and you 
> will be able to see the pattern or trend as well as the variation and 
> central tendency for each of the replicates. Visually, it should be 
> clear which number of topics works best. Finally, just to make this 
> statistically sound run a pair-wise comparison (Tukey’s test for 
> example) where the number of topics is your main factor. This 
> statistical analysis is just to prove significance of results. For 
> example, if you visually see that 10 topics is better than 5 but the 
> test says there are no statistical differences then there is no point 
> in running at 10 topics because is more computational expensive, you 
> might as well just run it at numTopic = 5.
>
> Other questions to consider:
>
> How large is your dataset?
>
> How many attacks are you injecting?
>
> How are you generating the attacks?
>
> How are you computing the AUROC?
>
> An AUROC of 0.52 basically tells you that you are finding nothing but 
> randomness.
>
> Best,
>
> Gustavo
>
> *From: *Christos Minas Mathas <mathas.ch.m@gmail.com>
> *Reply-To: *"user@spot.incubator.apache.org" 
> <user@spot.incubator.apache.org>
> *Date: *Tuesday, January 23, 2018 at 10:36 AM
> *To: *"user@spot.incubator.apache.org" <user@spot.incubator.apache.org>
> *Subject: *Spot-ml parameters configuration
>
> Hi,
>
> I've been evaluating the Netflow component of Spot for quite some time 
> now by using different kinds of attacks and collect the results. I'm 
> using the default configuration, I haven't changed any of the 
> parameters and the results I'm getting are not good. I was reading in 
> the users mailing list some responses from Gustavo Lujan Moreno back 
> in June 2017 in which he said about the results they're getting:
> /
> /"On proxy we are getting > 0.90 on AUC and on net flow >0.99."//
>
> http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E
>
> My results in terms of AUROC are more like ~0.52 or worse.
>
> He also gave some tips about configuring the parameters of spot-ml. So 
> I thought I'd try them.
>
> "/. . ."--ldamaxiterations 20” is the iteration parameter. You should 
> change that 20 for something higher, at least 100, ideally +200.//
> /. . ./
> /If you are not getting good results the number of iterations and 
> topics should be your priority.//"
>
> http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E
>
> 1. I changed ldamaxiterations to 200 but after running for ~60000 
> stages and 2 and a half hours there wasn't enough RAM in one of the 
> associated VMs and ml_ops exited with a StackOverflowException. So I 
> assigned 32GB of RAM to each one of the three VMs associated and this 
> time it stopped at ~20000 stages again with a StackOverflow from 
> another one of the associated VMs. How much RAM would I need for 200 
> iterations and for which services?
>
> 2. Can someone explain how can I properly configure the parameters of 
> spot-ml? Like for the topic count for example, how can I calculate an 
> approximate value of topics based on the traffic and the network setup?
>
> If you need further information on my setup or the results I'm getting 
> just let me know.
>
> Thanks in advance
>


Mime
View raw message