Hi,
The number of iterations at 100 should be a priority. If you are not able to run them you
have a technical problem with your cluster and you should fix it first. Once you are able
to run 100 iterations then you should start playing around with the number of topics. Start
with 5 topics, run 10 replicates, save the results (AUROC), change to 10 topics, do the same,
then go to 20, 50, 100, etc. Once you have the results, plot them (x axis number of topics,
y axis the AUROC) and you will be able to see the pattern or trend as well as the variation
and central tendency for each of the replicates. Visually, it should be clear which number
of topics works best. Finally, just to make this statistically sound run a pairwise comparison
(Tukey’s test for example) where the number of topics is your main factor. This statistical
analysis is just to prove significance of results. For example, if you visually see that 10
topics is better than 5 but the test says there are no statistical differences then there
is no point in running at 10 topics because is more computational expensive, you might as
well just run it at numTopic = 5.
Other questions to consider:
How large is your dataset?
How many attacks are you injecting?
How are you generating the attacks?
How are you computing the AUROC?
An AUROC of 0.52 basically tells you that you are finding nothing but randomness.
Best,
Gustavo
From: Christos Minas Mathas <mathas.ch.m@gmail.com>
ReplyTo: "user@spot.incubator.apache.org" <user@spot.incubator.apache.org>
Date: Tuesday, January 23, 2018 at 10:36 AM
To: "user@spot.incubator.apache.org" <user@spot.incubator.apache.org>
Subject: Spotml parameters configuration
Hi,
I've been evaluating the Netflow component of Spot for quite some time now by using different
kinds of attacks and collect the results. I'm using the default configuration, I haven't changed
any of the parameters and the results I'm getting are not good. I was reading in the users
mailing list some responses from Gustavo Lujan Moreno back in June 2017 in which he said about
the results they're getting:
"On proxy we are getting > 0.90 on AUC and on net flow >0.99."
http://mailarchives.apache.org/mod_mbox/spotuser/201706.mbox/%3C1DD58ED7BEE547E68886537EE480E3E1%40intel.com%3E
My results in terms of AUROC are more like ~0.52 or worse.
He also gave some tips about configuring the parameters of spotml. So I thought I'd try them.
". . ."ldamaxiterations 20” is the iteration parameter. You should change that 20 for
something higher, at least 100, ideally +200.
. . .
If you are not getting good results the number of iterations and topics should be your priority."
http://mailarchives.apache.org/mod_mbox/spotuser/201706.mbox/%3C4F588C3DB453466FBBCBF7F1ABE7CC8D%40intel.com%3E
1. I changed ldamaxiterations to 200 but after running for ~60000 stages and 2 and a half
hours there wasn't enough RAM in one of the associated VMs and ml_ops exited with a StackOverflowException.
So I assigned 32GB of RAM to each one of the three VMs associated and this time it stopped
at ~20000 stages again with a StackOverflow from another one of the associated VMs. How much
RAM would I need for 200 iterations and for which services?
2. Can someone explain how can I properly configure the parameters of spotml? Like for the
topic count for example, how can I calculate an approximate value of topics based on the traffic
and the network setup?
If you need further information on my setup or the results I'm getting just let me know.
Thanks in advance
