spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christos Mathas <mathas.c...@gmail.com>
Subject Re: Spot-ml parameters configuration
Date Wed, 31 Jan 2018 15:58:33 GMT
Hi Gustavo,

unfortunately I have zero knowledge of R programming, but I'll give it a 
try. So to use this script, I have to create an input file which 
contains only the attacks that were in the flow_results.csv and each 
entry should contain the columns host, score, rank, word? Also, what 
does "word" stand for?

Thank you


On 01/31/2018 05:12 PM, Lujan Moreno, Gustavo wrote:
>
> Hi Christos,
>
> If you are able to run 100 iterations it is fine, you can start 
> analyzing your results. Try to reach 200 iterations as a second 
> priority now.
>
> I have doubts about the way you are computing the AUROC. You said that 
> you are only using the first 100 values of the results. I assume they 
> are ordered and ranked. It is not totally incorrect to use the first n 
> values but that is not the usual AUROC. You need to compute it over 
> all the dataset. In order to do this you don’t have to know the rank 
> of every single row, just the rank of the attacks which we assume are 
> way fewer than normal data. I’m attaching a R script which computes 
> the AUROC given the TP and FP which is also computed in the script. 
> You need to specify the size of your normal dataset + attacks. This is 
> a very efficient way of computing the AUROC because you are only using 
> the ranks of the attacks. The script is designed to read several 
> results with a given file name pattern. You only requirement is to 
> read an input file with columns: host, score, rank, word, although you 
> can modify this. It also plots the ROC.
>
> Give it a try. You may need to modify a couple of lines.
>
> Best,
>
> Gustavo
>
> setwd("/Users/galujanm/Documents/R/Spark21_tuning/proxy2")
>
> #Call needed libraries
>
> library(pROC)
>
> library(ggplot2)
>
> #This is a function to insert row given a location
>
> insertRow<- function(existingDF, newrow, r) {
>
> existingDF[seq(r+1,nrow(existingDF)+1),] <- 
> existingDF[seq(r,nrow(existingDF)),]
>
> existingDF[r,] <- newrow
>
> existingDF
>
> }
>
> totalRows <- 722992037 #Only normal data
>
> totalRows<- totalRows+ 200. #I’m injecting 200 attacks in this case
>
> runs<-list.files(pattern= "spark21_proxy2_4_em*")
>
> masterDF <- data.frame()
>
> for (jj in 1:length(runs)){
>
> #readlines of file
>
> all_content = readLines(runs[jj])
>
> #skip the first 4 lines. This could vary depending on the csv file
>
> skip= all_content[-c(1:8)]
>
> #read the csv file
>
> df = read.csv(textConnection(skip), header = F, stringsAsFactors = FALSE)
>
> #Very important line
>
> df <- df[!duplicated(df),]
>
> #assign the column names
>
> colnames(df) <- c('host','score','rank','word')
>
> #order by rank
>
> df <- df[with(df, order(rank)),]
>
> #everything is bad here
>
> df$label <- 'bad'
>
> #Value just to plot
>
> df$y = 1
>
> #dataframe to be plotted
>
> toplot <- df[,c('rank','y')]
>
> #add a little bit of horizontal jitter to plot
>
> toplot$rank <- toplot$rank + as.integer((rnorm(nrow(toplot), sd = 10)))
>
> #adding a little bit of vertical jitter
>
> toplot$y <-rnorm(nrow(toplot))*.05
>
> #Create true positive and false positive columns
>
> df$TP <- -99
>
> df$FP <- -99
>
> #insert row
>
> df <- insertRow(df,c(1,1,0,0,'null',1,-99,-99),1)
>
> #convert to numeric
>
> df$rank <- as.numeric(df$rank)
>
> #order by rank, although I thought it was already order, anyways...
>
> df <- df[order(df$rank),]
>
> #row bind to df
>
> df <- rbind(df,c(1,1,0,0,'null',1,1,1))
>
> #Convert to numeric, or make sure everything is numeric
>
> df$rank <- as.numeric(df$rank)
>
> df$TP <- as.numeric(df$TP)
>
> df$FP <- as.numeric(df$FP)
>
> df$TP[1] <- 0
>
> df$FP[1] <- 0
>
> #next lines compute the TP and FP rate
>
> for (i in 2:(nrow(df)-1)){
>
> df$TP[i] <- (i-1)/(nrow(df)-2)
>
> #Formula corrected
>
> df$FP[i] <- (df$rank[i]-i+1)/ (totalRows - nrow(df)-2)
>
> }
>
> #Next line computes the ACU
>
> AUC <- 0
>
> for (i in 1:(nrow(df)-1)){
>
> AUC <- AUC + (df$TP[i+1]) *(df$FP[i+1]-df$FP[i])
>
> }
>
> #Tracks the AUC for each of the replicates
>
> df$y <- c(0:(nrow(df)-1))
>
> #Write results of the df for a single replicate
>
> #Plot and save ranks visualization
>
> ggplot(toplot,aes(x=rank,y=y)) + geom_point(pch = 21,position = 
> position_jitter(width = 1)) +
>
> ylim(-0.5,.5) + xlim(-round(totalRows*.1,0), totalRows) +
>
> ggtitle("Rank for anomalies")
>
> #ggsave('rank.png',width = 15, height =10, units = 'cm')
>
> #Plot and save the last replicate ACU
>
> ggplot(df, aes(FP,TP)) + geom_line() +
>
> xlab("FP (1-specificity)") + ylab("TP (sensititiy)") +
>
> ggtitle('ROC-AUC Proxy') +
>
> geom_abline(slope=1, intercept=0)
>
> print(runs[jj])
>
> print(AUC)
>
> }
>
> *From: *Christos Mathas <mathas.ch.m@gmail.com>
> *Reply-To: *"user@spot.incubator.apache.org" 
> <user@spot.incubator.apache.org>
> *Date: *Wednesday, January 31, 2018 at 3:33 AM
> *To: *"user@spot.incubator.apache.org" <user@spot.incubator.apache.org>
> *Subject: *Re: Spot-ml parameters configuration
>
> Hi,
>
> I tried increasing the number of max iterations in steps. I did it 
> with 50, 70, 100, 150 and 200. It worked for all numbers except for 
> 200, where I got this:
>
> [Stage 37393:==================================================>(198 + 
> 2) / 200]Exception in thread "main" org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task serialization failed: 
> java.lang.StackOverflowError
>
> and the whole output of ml_ops.sh:
>
> https://www.dropbox.com/s/maq0efs8c0xb8ga/ml.out?dl=0
>
> I should also note that the last time it failed with the iterations 
> set at 200, the error was about a task failing because Yarn lost an 
> executor, but again due to java.lang.StackOverflowError. The only 
> thing I have changed between the two executions is this: export 
> JAVA_TOOL_OPTIONS="-Xmx16G", which I set in /etc/profile.d/ for 
> everything in the system, to check if it was the default java memory 
> value that was causing the problem.
>
> So still stuck at fixing whatever is causing this. I don't have any 
> more leads as to which configuration I should examine. When I do, I 
> will also check the topics parameter as you suggested.
>
> On your other questions:
> "How large is your dataset?"
> I will give as an example the dataset I used for the executions of 
> ml_ops with the different iteration values as described above, which 
> is ~1Mb. Here is a screenshot from hdfs:
>
> https://www.dropbox.com/s/c6hp937zvl77s3p/hdfs_hive.png?dl=0
>
> "How many attacks.." "How are you generating.."
> I have a Kali VM inside the network from which I deploy the attacks in 
> real-time while the traffic is generated. I do one attack at a time. 
> For this particular dataset I used Armitage to do an nmap scan and a 
> Hail Mary attack which is a really "noisy" and not sophisticated attack.
>
> "How are you computing AUROC?"
> I am computing AUROC with MS Excel, by taking as input the first 100 
> rows of the flow_results.csv. I have uploaded a file in dropbox so it 
> is more clear as to how exactly I'm doing it.
>
> https://www.dropbox.com/s/7k6erupp5jbpnpm/Hail%20Mary_9_ROC.xlsx?dl=0
>
> Thank you
>
> On 01/29/2018 04:49 PM, Lujan Moreno, Gustavo wrote:
>
>     Hi,
>
>     The number of iterations at 100 should be a priority. If you are
>     not able to run them you have a technical problem with your
>     cluster and you should fix it first. Once you are able to run 100
>     iterations then you should start playing around with the number of
>     topics. Start with 5 topics, run 10 replicates, save the results
>     (AUROC), change to 10 topics, do the same, then go to 20, 50, 100,
>     etc. Once you have the results, plot them (x axis number of
>     topics, y axis the AUROC) and you will be able to see the pattern
>     or trend as well as the variation and central tendency for each of
>     the replicates. Visually, it should be clear which number of
>     topics works best. Finally, just to make this statistically sound
>     run a pair-wise comparison (Tukey’s test for example) where the
>     number of topics is your main factor. This statistical analysis is
>     just to prove significance of results. For example, if you
>     visually see that 10 topics is better than 5 but the test says
>     there are no statistical differences then there is no point in
>     running at 10 topics because is more computational expensive, you
>     might as well just run it at numTopic = 5.
>
>     Other questions to consider:
>
>     How large is your dataset?
>
>     How many attacks are you injecting?
>
>     How are you generating the attacks?
>
>     How are you computing the AUROC?
>
>     An AUROC of 0.52 basically tells you that you are finding nothing
>     but randomness.
>
>     Best,
>
>     Gustavo
>
>     *From: *Christos Minas Mathas <mathas.ch.m@gmail.com>
>     <mailto:mathas.ch.m@gmail.com>
>     *Reply-To: *"user@spot.incubator.apache.org"
>     <mailto:user@spot.incubator.apache.org>
>     <user@spot.incubator.apache.org>
>     <mailto:user@spot.incubator.apache.org>
>     *Date: *Tuesday, January 23, 2018 at 10:36 AM
>     *To: *"user@spot.incubator.apache.org"
>     <mailto:user@spot.incubator.apache.org>
>     <user@spot.incubator.apache.org>
>     <mailto:user@spot.incubator.apache.org>
>     *Subject: *Spot-ml parameters configuration
>
>     Hi,
>
>     I've been evaluating the Netflow component of Spot for quite some
>     time now by using different kinds of attacks and collect the
>     results. I'm using the default configuration, I haven't changed
>     any of the parameters and the results I'm getting are not good. I
>     was reading in the users mailing list some responses from Gustavo
>     Lujan Moreno back in June 2017 in which he said about the results
>     they're getting:
>     /
>     /"On proxy we are getting > 0.90 on AUC and on net flow >0.99."//
>
>     http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E
>
>     My results in terms of AUROC are more like ~0.52 or worse.
>
>     He also gave some tips about configuring the parameters of
>     spot-ml. So I thought I'd try them.
>
>     "/. . ."--ldamaxiterations 20” is the iteration parameter. You
>     should change that 20 for something higher, at least 100, ideally
>     +200.//
>     /. . ./
>     /If you are not getting good results the number of iterations and
>     topics should be your priority.//"
>
>     http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E
>
>     1. I changed ldamaxiterations to 200 but after running for ~60000
>     stages and 2 and a half hours there wasn't enough RAM in one of
>     the associated VMs and ml_ops exited with a
>     StackOverflowException. So I assigned 32GB of RAM to each one of
>     the three VMs associated and this time it stopped at ~20000 stages
>     again with a StackOverflow from another one of the associated VMs.
>     How much RAM would I need for 200 iterations and for which services?
>
>     2. Can someone explain how can I properly configure the parameters
>     of spot-ml? Like for the topic count for example, how can I
>     calculate an approximate value of topics based on the traffic and
>     the network setup?
>
>     If you need further information on my setup or the results I'm
>     getting just let me know.
>
>     Thanks in advance
>
>
>


Mime
View raw message