spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christos Mathas <mathas.c...@gmail.com>
Subject Re: Spot-ml parameters configuration
Date Wed, 24 Jan 2018 11:29:20 GMT
Ok, so I got some screenshots, if any more are needed just tell me which.

The spot.conf file:

https://www.dropbox.com/s/uglbdqflivmtytk/spot.conf.png?dl=0

The hosts in the cluster and their roles:

https://www.dropbox.com/s/2dtkvpjmz9lyrs5/cloudera%201.png?dl=0

https://www.dropbox.com/s/4kx5u85jignfq8c/cloudera%202.png?dl=0

https://www.dropbox.com/s/xsqrb4gew0ujezl/cloudera%20manager.png?dl=0

Also I found out some errors that I think occured after assigning more 
RAM to the hosts but I can't try to resolve them right now because there 
is some maintenance going on these days and I can't access the VMware 
ESXi interface. I got screenshots of these too:

https://www.dropbox.com/s/pmwwxtlrcro3jax/history%20server%20error.png?dl=0

https://www.dropbox.com/s/kmnn6oz04f57vr0/memory%20warning.png?dl=0


On 01/23/2018 07:46 PM, Ricardo Barona wrote:
> Got it. Let's see what is in your configuration file and how is your 
> cluster configured to get the most of it.
>
> On Tue, Jan 23, 2018 at 11:40 AM, Christos Minas Mathas 
> <mathas.ch.m@gmail.com <mailto:mathas.ch.m@gmail.com>> wrote:
>
>     Hi Ricardo,
>
>     first of all thank you for your answer. What you said on the first
>     topic reminded me of something really important I forgot to mention:
>
>     I used the exact same Netflow data for the two executions of
>     ml_ops. The only thing I changed was that I increased the amount
>     of RAM as described. I was monitoring the VMs with htop during
>     both executions and the CPU/RAM behavior was totally different
>     between the two executions which is already obvious by the fact
>     that the first one failed at ~60000 stages and the second one with
>     more RAM failed at ~20000 stages.
>
>     As you say for now all I can do is keep playing with the
>     parameters and see what happens. However, the reason I'm writing
>     to the list is I'm currently writing my undergraduate thesis on
>     the evaluation of Apache Spot by using penetration testing
>     techniques, and I'm trying to get a better handle on how it works
>     and hopefully make it give me some better results.
>
>
>     On 01/23/2018 07:12 PM, Ricardo Barona wrote:
>>     Hi Christos,
>>
>>     Here are my thoughts about your questions.
>>
>>     1. From my experience, working with memory in Apache Spark and
>>     therefore in Apache Spot, you need to know how your data is
>>     distributed, the size of your files and the number of files you
>>     are trying to process. The reasons to get out of memory error can
>>     be for many things but one of the most commons reasons is the
>>     number of topics.
>>     When using default 20, for each row in your data set, using
>>     NetFlow data you are going to add 4 vectors of 20 Doubles each.
>>     One vector is for the 20 probabilities over topic for source IP,
>>     another vector is for 20 probabilities over topic for destination
>>     IP, one more for source word and another for destination word. If
>>     you are running DNS or Proxy the payload is half as we only
>>     analyse one IP and one word (word is composed with information of
>>     the same row).
>>     Given that, you need to make sure you executors can fit that
>>     amount of data or reduce the number of topic. There is no
>>     concrete answer (sadly) other than play and get it to work.
>>
>>     2. Number of topics is another discussion that often is worthy
>>     for its own papers/discussions in many forums, but from what I've
>>     seen in the past the number of topics in Apache Spot can even be
>>     something as small as 5 and the results are going to be similar
>>     to what you get with 20. I don't have the documents to backup
>>     that but you can try your self and see if what I'm saying is
>>     correct. Again, number of topics as well as hyper parameters
>>     tuning is something you need to play with before you get the best
>>     solution.
>>
>>     Sorry I can't provide more information, as a Software Engineer
>>     that's as much as I can take from my past conversations with Data
>>     Scientist (like Gustavo). If you'd create a list of parameters
>>     you want more information please reply with the same and I'll
>>     give you my input from what I have seen.
>>
>>     Thanks!
>>
>>
>>     On Tue, Jan 23, 2018 at 10:35 AM, Christos Minas Mathas
>>     <mathas.ch.m@gmail.com <mailto:mathas.ch.m@gmail.com>> wrote:
>>
>>         Hi,
>>
>>         I've been evaluating the Netflow component of Spot for quite
>>         some time now by using different kinds of attacks and collect
>>         the results. I'm using the default configuration, I haven't
>>         changed any of the parameters and the results I'm getting are
>>         not good. I was reading in the users mailing list some
>>         responses from Gustavo Lujan Moreno back in June 2017 in
>>         which he said about the results they're getting:
>>         /
>>         //"On proxy we are getting > 0.90 on AUC and on net flow >0.99."/
>>
>>         http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E
>>         <http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E>
>>
>>         My results in terms of AUROC are more like ~0.52 or worse.
>>
>>         He also gave some tips about configuring the parameters of
>>         spot-ml. So I thought I'd try them.
>>
>>         "/. . ."--ldamaxiterations 20” is the iteration parameter.
>>         You should change that 20 for something higher, at least 100,
>>         ideally +200.//
>>         //. . .//
>>         //If you are not getting good results the number of
>>         iterations and topics should be your priority./"
>>
>>         http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E
>>         <http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E>
>>
>>         1. I changed ldamaxiterations to 200 but after running for
>>         ~60000 stages and 2 and a half hours there wasn't enough RAM
>>         in one of the associated VMs and ml_ops exited with a
>>         StackOverflowException. So I assigned 32GB of RAM to each one
>>         of the three VMs associated and this time it stopped at
>>         ~20000 stages again with a StackOverflow from another one of
>>         the associated VMs. How much RAM would I need for 200
>>         iterations and for which services?
>>
>>         2. Can someone explain how can I properly configure the
>>         parameters of spot-ml? Like for the topic count for example,
>>         how can I calculate an approximate value of topics based on
>>         the traffic and the network setup?
>>
>>         If you need further information on my setup or the results
>>         I'm getting just let me know.
>>
>>         Thanks in advance
>>
>>
>
>


Mime
View raw message