Hi,
I tried increasing the number of max iterations in steps. I did it with
50, 70, 100, 150 and 200. It worked for all numbers except for 200,
where I got this:
[Stage 37393:==================================================>(198 +
2) / 200]Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task serialization failed:
java.lang.StackOverflowError
and the whole output of ml_ops.sh:
https://www.dropbox.com/s/maq0efs8c0xb8ga/ml.out?dl=0
I should also note that the last time it failed with the iterations set
at 200, the error was about a task failing because Yarn lost an
executor, but again due to java.lang.StackOverflowError. The only thing
I have changed between the two executions is this: export
JAVA_TOOL_OPTIONS="Xmx16G", which I set in /etc/profile.d/ for
everything in the system, to check if it was the default java memory
value that was causing the problem.
So still stuck at fixing whatever is causing this. I don't have any more
leads as to which configuration I should examine. When I do, I will also
check the topics parameter as you suggested.
On your other questions:
"How large is your dataset?"
I will give as an example the dataset I used for the executions of
ml_ops with the different iteration values as described above, which is
~1Mb. Here is a screenshot from hdfs:
https://www.dropbox.com/s/c6hp937zvl77s3p/hdfs_hive.png?dl=0
"How many attacks.." "How are you generating.."
I have a Kali VM inside the network from which I deploy the attacks in
realtime while the traffic is generated. I do one attack at a time. For
this particular dataset I used Armitage to do an nmap scan and a Hail
Mary attack which is a really "noisy" and not sophisticated attack.
"How are you computing AUROC?"
I am computing AUROC with MS Excel, by taking as input the first 100
rows of the flow_results.csv. I have uploaded a file in dropbox so it is
more clear as to how exactly I'm doing it.
https://www.dropbox.com/s/7k6erupp5jbpnpm/Hail%20Mary_9_ROC.xlsx?dl=0
Thank you
On 01/29/2018 04:49 PM, Lujan Moreno, Gustavo wrote:
>
> Hi,
>
> The number of iterations at 100 should be a priority. If you are not
> able to run them you have a technical problem with your cluster and
> you should fix it first. Once you are able to run 100 iterations then
> you should start playing around with the number of topics. Start with
> 5 topics, run 10 replicates, save the results (AUROC), change to 10
> topics, do the same, then go to 20, 50, 100, etc. Once you have the
> results, plot them (x axis number of topics, y axis the AUROC) and you
> will be able to see the pattern or trend as well as the variation and
> central tendency for each of the replicates. Visually, it should be
> clear which number of topics works best. Finally, just to make this
> statistically sound run a pairwise comparison (Tukey’s test for
> example) where the number of topics is your main factor. This
> statistical analysis is just to prove significance of results. For
> example, if you visually see that 10 topics is better than 5 but the
> test says there are no statistical differences then there is no point
> in running at 10 topics because is more computational expensive, you
> might as well just run it at numTopic = 5.
>
> Other questions to consider:
>
> How large is your dataset?
>
> How many attacks are you injecting?
>
> How are you generating the attacks?
>
> How are you computing the AUROC?
>
> An AUROC of 0.52 basically tells you that you are finding nothing but
> randomness.
>
> Best,
>
> Gustavo
>
> *From: *Christos Minas Mathas <mathas.ch.m@gmail.com>
> *ReplyTo: *"user@spot.incubator.apache.org"
> <user@spot.incubator.apache.org>
> *Date: *Tuesday, January 23, 2018 at 10:36 AM
> *To: *"user@spot.incubator.apache.org" <user@spot.incubator.apache.org>
> *Subject: *Spotml parameters configuration
>
> Hi,
>
> I've been evaluating the Netflow component of Spot for quite some time
> now by using different kinds of attacks and collect the results. I'm
> using the default configuration, I haven't changed any of the
> parameters and the results I'm getting are not good. I was reading in
> the users mailing list some responses from Gustavo Lujan Moreno back
> in June 2017 in which he said about the results they're getting:
> /
> /"On proxy we are getting > 0.90 on AUC and on net flow >0.99."//
>
> http://mailarchives.apache.org/mod_mbox/spotuser/201706.mbox/%3C1DD58ED7BEE547E68886537EE480E3E1%40intel.com%3E
>
> My results in terms of AUROC are more like ~0.52 or worse.
>
> He also gave some tips about configuring the parameters of spotml. So
> I thought I'd try them.
>
> "/. . ."ldamaxiterations 20” is the iteration parameter. You should
> change that 20 for something higher, at least 100, ideally +200.//
> /. . ./
> /If you are not getting good results the number of iterations and
> topics should be your priority.//"
>
> http://mailarchives.apache.org/mod_mbox/spotuser/201706.mbox/%3C4F588C3DB453466FBBCBF7F1ABE7CC8D%40intel.com%3E
>
> 1. I changed ldamaxiterations to 200 but after running for ~60000
> stages and 2 and a half hours there wasn't enough RAM in one of the
> associated VMs and ml_ops exited with a StackOverflowException. So I
> assigned 32GB of RAM to each one of the three VMs associated and this
> time it stopped at ~20000 stages again with a StackOverflow from
> another one of the associated VMs. How much RAM would I need for 200
> iterations and for which services?
>
> 2. Can someone explain how can I properly configure the parameters of
> spotml? Like for the topic count for example, how can I calculate an
> approximate value of topics based on the traffic and the network setup?
>
> If you need further information on my setup or the results I'm getting
> just let me know.
>
> Thanks in advance
>
