spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christos Minas Mathas <>
Subject Re: Spot-ml parameters configuration
Date Tue, 23 Jan 2018 17:40:52 GMT
Hi Ricardo,

first of all thank you for your answer. What you said on the first topic 
reminded me of something really important I forgot to mention:

I used the exact same Netflow data for the two executions of ml_ops. The 
only thing I changed was that I increased the amount of RAM as 
described. I was monitoring the VMs with htop during both executions and 
the CPU/RAM behavior was totally different between the two executions 
which is already obvious by the fact that the first one failed at ~60000 
stages and the second one with more RAM failed at ~20000 stages.

As you say for now all I can do is keep playing with the parameters and 
see what happens. However, the reason I'm writing to the list is I'm 
currently writing my undergraduate thesis on the evaluation of Apache 
Spot by using penetration testing techniques, and I'm trying to get a 
better handle on how it works and hopefully make it give me some better 

On 01/23/2018 07:12 PM, Ricardo Barona wrote:
> Hi Christos,
> Here are my thoughts about your questions.
> 1. From my experience, working with memory in Apache Spark and 
> therefore in Apache Spot, you need to know how your data is 
> distributed, the size of your files and the number of files you are 
> trying to process. The reasons to get out of memory error can be for 
> many things but one of the most commons reasons is the number of topics.
> When using default 20, for each row in your data set, using NetFlow 
> data you are going to add 4 vectors of 20 Doubles each. One vector is 
> for the 20 probabilities over topic for source IP, another vector is 
> for 20 probabilities over topic for destination IP, one more for 
> source word and another for destination word. If you are running DNS 
> or Proxy the payload is half as we only analyse one IP and one word 
> (word is composed with information of the same row).
> Given that, you need to make sure you executors can fit that amount of 
> data or reduce the number of topic. There is no concrete answer 
> (sadly) other than play and get it to work.
> 2. Number of topics is another discussion that often is worthy for its 
> own papers/discussions in many forums, but from what I've seen in the 
> past the number of topics in Apache Spot can even be something as 
> small as 5 and the results are going to be similar to what you get 
> with 20. I don't have the documents to backup that but you can try 
> your self and see if what I'm saying is correct. Again, number of 
> topics as well as hyper parameters tuning is something you need to 
> play with before you get the best solution.
> Sorry I can't provide more information, as a Software Engineer that's 
> as much as I can take from my past conversations with Data Scientist 
> (like Gustavo). If you'd create a list of parameters you want more 
> information please reply with the same and I'll give you my input from 
> what I have seen.
> Thanks!
> On Tue, Jan 23, 2018 at 10:35 AM, Christos Minas Mathas 
> < <>> wrote:
>     Hi,
>     I've been evaluating the Netflow component of Spot for quite some
>     time now by using different kinds of attacks and collect the
>     results. I'm using the default configuration, I haven't changed
>     any of the parameters and the results I'm getting are not good. I
>     was reading in the users mailing list some responses from Gustavo
>     Lujan Moreno back in June 2017 in which he said about the results
>     they're getting:
>     /
>     //"On proxy we are getting > 0.90 on AUC and on net flow >0.99."/
>     <>
>     My results in terms of AUROC are more like ~0.52 or worse.
>     He also gave some tips about configuring the parameters of
>     spot-ml. So I thought I'd try them.
>     "/. . ."--ldamaxiterations 20” is the iteration parameter. You
>     should change that 20 for something higher, at least 100, ideally
>     +200.//
>     //. . .//
>     //If you are not getting good results the number of iterations and
>     topics should be your priority./"
>     <>
>     1. I changed ldamaxiterations to 200 but after running for ~60000
>     stages and 2 and a half hours there wasn't enough RAM in one of
>     the associated VMs and ml_ops exited with a
>     StackOverflowException. So I assigned 32GB of RAM to each one of
>     the three VMs associated and this time it stopped at ~20000 stages
>     again with a StackOverflow from another one of the associated VMs.
>     How much RAM would I need for 200 iterations and for which services?
>     2. Can someone explain how can I properly configure the parameters
>     of spot-ml? Like for the topic count for example, how can I
>     calculate an approximate value of topics based on the traffic and
>     the network setup?
>     If you need further information on my setup or the results I'm
>     getting just let me know.
>     Thanks in advance

View raw message