spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricardo Barona <>
Subject Re: Spot-ml parameters configuration
Date Tue, 23 Jan 2018 17:46:51 GMT
Got it. Let's see what is in your configuration file and how is your
cluster configured to get the most of it.

On Tue, Jan 23, 2018 at 11:40 AM, Christos Minas Mathas <> wrote:

> Hi Ricardo,
> first of all thank you for your answer. What you said on the first topic
> reminded me of something really important I forgot to mention:
> I used the exact same Netflow data for the two executions of ml_ops. The
> only thing I changed was that I increased the amount of RAM as described. I
> was monitoring the VMs with htop during both executions and the CPU/RAM
> behavior was totally different between the two executions which is already
> obvious by the fact that the first one failed at ~60000 stages and the
> second one with more RAM failed at ~20000 stages.
> As you say for now all I can do is keep playing with the parameters and
> see what happens. However, the reason I'm writing to the list is I'm
> currently writing my undergraduate thesis on the evaluation of Apache Spot
> by using penetration testing techniques, and I'm trying to get a better
> handle on how it works and hopefully make it give me some better results.
> On 01/23/2018 07:12 PM, Ricardo Barona wrote:
> Hi Christos,
> Here are my thoughts about your questions.
> 1. From my experience, working with memory in Apache Spark and therefore
> in Apache Spot, you need to know how your data is distributed, the size of
> your files and the number of files you are trying to process. The reasons
> to get out of memory error can be for many things but one of the most
> commons reasons is the number of topics.
> When using default 20, for each row in your data set, using NetFlow data
> you are going to add 4 vectors of 20 Doubles each. One vector is for the 20
> probabilities over topic for source IP, another vector is for 20
> probabilities over topic for destination IP, one more for source word and
> another for destination word. If you are running DNS or Proxy the payload
> is half as we only analyse one IP and one word (word is composed with
> information of the same row).
> Given that, you need to make sure you executors can fit that amount of
> data or reduce the number of topic. There is no concrete answer (sadly)
> other than play and get it to work.
> 2. Number of topics is another discussion that often is worthy for its own
> papers/discussions in many forums, but from what I've seen in the past the
> number of topics in Apache Spot can even be something as small as 5 and the
> results are going to be similar to what you get with 20. I don't have the
> documents to backup that but you can try your self and see if what I'm
> saying is correct. Again, number of topics as well as hyper parameters
> tuning is something you need to play with before you get the best solution.
> Sorry I can't provide more information, as a Software Engineer that's as
> much as I can take from my past conversations with Data Scientist (like
> Gustavo). If you'd create a list of parameters you want more information
> please reply with the same and I'll give you my input from what I have seen.
> Thanks!
> On Tue, Jan 23, 2018 at 10:35 AM, Christos Minas Mathas <
>> wrote:
>> Hi,
>> I've been evaluating the Netflow component of Spot for quite some time
>> now by using different kinds of attacks and collect the results. I'm using
>> the default configuration, I haven't changed any of the parameters and the
>> results I'm getting are not good. I was reading in the users mailing list
>> some responses from Gustavo Lujan Moreno back in June 2017 in which he said
>> about the results they're getting:
>> *"On proxy we are getting > 0.90 on AUC and on net flow >0.99."*
>> ox/
>> My results in terms of AUROC are more like ~0.52 or worse.
>> He also gave some tips about configuring the parameters of spot-ml. So I
>> thought I'd try them.
>> "*. . ."--ldamaxiterations 20” is the iteration parameter. You should
>> change that 20 for something higher, at least 100, ideally +200.*
>> *. . .*
>> *If you are not getting good results the number of iterations and topics
>> should be your priority.*"
>> ox/
>> 1. I changed ldamaxiterations to 200 but after running for ~60000 stages
>> and 2 and a half hours there wasn't enough RAM in one of the associated VMs
>> and ml_ops exited with a StackOverflowException. So I assigned 32GB of RAM
>> to each one of the three VMs associated and this time it stopped at ~20000
>> stages again with a StackOverflow from another one of the associated VMs.
>> How much RAM would I need for 200 iterations and for which services?
>> 2. Can someone explain how can I properly configure the parameters of
>> spot-ml? Like for the topic count for example, how can I calculate an
>> approximate value of topics based on the traffic and the network setup?
>> If you need further information on my setup or the results I'm getting
>> just let me know.
>> Thanks in advance

View raw message