spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christos Mathas <>
Subject Re: Spot-ml parameters configuration
Date Wed, 31 Jan 2018 09:33:01 GMT

I tried increasing the number of max iterations in steps. I did it with 
50, 70, 100, 150 and 200. It worked for all numbers except for 200, 
where I got this:

[Stage 37393:==================================================>(198 + 
2) / 200]Exception in thread "main" org.apache.spark.SparkException: Job 
aborted due to stage failure: Task serialization failed: 

and the whole output of

I should also note that the last time it failed with the iterations set 
at 200, the error was about a task failing because Yarn lost an 
executor, but again due to java.lang.StackOverflowError. The only thing 
I have changed between the two executions is this: export 
JAVA_TOOL_OPTIONS="-Xmx16G", which I set in /etc/profile.d/ for 
everything in the system, to check if it was the default java memory 
value that was causing the problem.

So still stuck at fixing whatever is causing this. I don't have any more 
leads as to which configuration I should examine. When I do, I will also 
check the topics parameter as you suggested.

On your other questions:
"How large is your dataset?"
I will give as an example the dataset I used for the executions of 
ml_ops with the different iteration values as described above, which is 
~1Mb. Here is a screenshot from hdfs:

"How many attacks.." "How are you generating.."
I have a Kali VM inside the network from which I deploy the attacks in 
real-time while the traffic is generated. I do one attack at a time. For 
this particular dataset I used Armitage to do an nmap scan and a Hail 
Mary attack which is a really "noisy" and not sophisticated attack.

"How are you computing AUROC?"
I am computing AUROC with MS Excel, by taking as input the first 100 
rows of the flow_results.csv. I have uploaded a file in dropbox so it is 
more clear as to how exactly I'm doing it.

Thank you

On 01/29/2018 04:49 PM, Lujan Moreno, Gustavo wrote:
> Hi,
> The number of iterations at 100 should be a priority. If you are not 
> able to run them you have a technical problem with your cluster and 
> you should fix it first. Once you are able to run 100 iterations then 
> you should start playing around with the number of topics. Start with 
> 5 topics, run 10 replicates, save the results (AUROC), change to 10 
> topics, do the same, then go to 20, 50, 100, etc. Once you have the 
> results, plot them (x axis number of topics, y axis the AUROC) and you 
> will be able to see the pattern or trend as well as the variation and 
> central tendency for each of the replicates. Visually, it should be 
> clear which number of topics works best. Finally, just to make this 
> statistically sound run a pair-wise comparison (Tukey’s test for 
> example) where the number of topics is your main factor. This 
> statistical analysis is just to prove significance of results. For 
> example, if you visually see that 10 topics is better than 5 but the 
> test says there are no statistical differences then there is no point 
> in running at 10 topics because is more computational expensive, you 
> might as well just run it at numTopic = 5.
> Other questions to consider:
> How large is your dataset?
> How many attacks are you injecting?
> How are you generating the attacks?
> How are you computing the AUROC?
> An AUROC of 0.52 basically tells you that you are finding nothing but 
> randomness.
> Best,
> Gustavo
> *From: *Christos Minas Mathas <>
> *Reply-To: *"" 
> <>
> *Date: *Tuesday, January 23, 2018 at 10:36 AM
> *To: *"" <>
> *Subject: *Spot-ml parameters configuration
> Hi,
> I've been evaluating the Netflow component of Spot for quite some time 
> now by using different kinds of attacks and collect the results. I'm 
> using the default configuration, I haven't changed any of the 
> parameters and the results I'm getting are not good. I was reading in 
> the users mailing list some responses from Gustavo Lujan Moreno back 
> in June 2017 in which he said about the results they're getting:
> /
> /"On proxy we are getting > 0.90 on AUC and on net flow >0.99."//
> My results in terms of AUROC are more like ~0.52 or worse.
> He also gave some tips about configuring the parameters of spot-ml. So 
> I thought I'd try them.
> "/. . ."--ldamaxiterations 20” is the iteration parameter. You should 
> change that 20 for something higher, at least 100, ideally +200.//
> /. . ./
> /If you are not getting good results the number of iterations and 
> topics should be your priority.//"
> 1. I changed ldamaxiterations to 200 but after running for ~60000 
> stages and 2 and a half hours there wasn't enough RAM in one of the 
> associated VMs and ml_ops exited with a StackOverflowException. So I 
> assigned 32GB of RAM to each one of the three VMs associated and this 
> time it stopped at ~20000 stages again with a StackOverflow from 
> another one of the associated VMs. How much RAM would I need for 200 
> iterations and for which services?
> 2. Can someone explain how can I properly configure the parameters of 
> spot-ml? Like for the topic count for example, how can I calculate an 
> approximate value of topics based on the traffic and the network setup?
> If you need further information on my setup or the results I'm getting 
> just let me know.
> Thanks in advance

View raw message