spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricardo Barona <>
Subject Re: Spot-ml parameters configuration
Date Tue, 23 Jan 2018 17:12:27 GMT
Hi Christos,

Here are my thoughts about your questions.

1. From my experience, working with memory in Apache Spark and therefore in
Apache Spot, you need to know how your data is distributed, the size of
your files and the number of files you are trying to process. The reasons
to get out of memory error can be for many things but one of the most
commons reasons is the number of topics.
When using default 20, for each row in your data set, using NetFlow data
you are going to add 4 vectors of 20 Doubles each. One vector is for the 20
probabilities over topic for source IP, another vector is for 20
probabilities over topic for destination IP, one more for source word and
another for destination word. If you are running DNS or Proxy the payload
is half as we only analyse one IP and one word (word is composed with
information of the same row).
Given that, you need to make sure you executors can fit that amount of data
or reduce the number of topic. There is no concrete answer (sadly) other
than play and get it to work.

2. Number of topics is another discussion that often is worthy for its own
papers/discussions in many forums, but from what I've seen in the past the
number of topics in Apache Spot can even be something as small as 5 and the
results are going to be similar to what you get with 20. I don't have the
documents to backup that but you can try your self and see if what I'm
saying is correct. Again, number of topics as well as hyper parameters
tuning is something you need to play with before you get the best solution.

Sorry I can't provide more information, as a Software Engineer that's as
much as I can take from my past conversations with Data Scientist (like
Gustavo). If you'd create a list of parameters you want more information
please reply with the same and I'll give you my input from what I have seen.


On Tue, Jan 23, 2018 at 10:35 AM, Christos Minas Mathas <> wrote:

> Hi,
> I've been evaluating the Netflow component of Spot for quite some time now
> by using different kinds of attacks and collect the results. I'm using the
> default configuration, I haven't changed any of the parameters and the
> results I'm getting are not good. I was reading in the users mailing list
> some responses from Gustavo Lujan Moreno back in June 2017 in which he said
> about the results they're getting:
> *"On proxy we are getting > 0.90 on AUC and on net flow >0.99."*
> mbox/
> My results in terms of AUROC are more like ~0.52 or worse.
> He also gave some tips about configuring the parameters of spot-ml. So I
> thought I'd try them.
> "*. . ."--ldamaxiterations 20” is the iteration parameter. You should
> change that 20 for something higher, at least 100, ideally +200.*
> *. . .*
> *If you are not getting good results the number of iterations and topics
> should be your priority.*"
> mbox/
> 1. I changed ldamaxiterations to 200 but after running for ~60000 stages
> and 2 and a half hours there wasn't enough RAM in one of the associated VMs
> and ml_ops exited with a StackOverflowException. So I assigned 32GB of RAM
> to each one of the three VMs associated and this time it stopped at ~20000
> stages again with a StackOverflow from another one of the associated VMs.
> How much RAM would I need for 200 iterations and for which services?
> 2. Can someone explain how can I properly configure the parameters of
> spot-ml? Like for the topic count for example, how can I calculate an
> approximate value of topics based on the traffic and the network setup?
> If you need further information on my setup or the results I'm getting
> just let me know.
> Thanks in advance

View raw message