spot-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lujan Moreno, Gustavo" <gustavo.lujan.mor...@intel.com>
Subject Re: Spot-ml parameters configuration
Date Wed, 31 Jan 2018 16:21:54 GMT
Hi,

“Word” column is the word created for that specific log, it is the input for the LDA.
It is really not necessary to compute the AUROC. I left it in my code to have an idea of the
words that were being ranked high/low. Just discard it for now as well as the host. You only
need the score (probability) and the rank.

Best,

Gustavo


From: Christos Mathas <mathas.ch.m@gmail.com>
Reply-To: "user@spot.incubator.apache.org" <user@spot.incubator.apache.org>
Date: Wednesday, January 31, 2018 at 9:58 AM
To: "Lujan Moreno, Gustavo" <gustavo.lujan.moreno@intel.com>, "user@spot.incubator.apache.org"
<user@spot.incubator.apache.org>
Subject: Re: Spot-ml parameters configuration


Hi Gustavo,

unfortunately I have zero knowledge of R programming, but I'll give it a try. So to use this
script, I have to create an input file which contains only the attacks that were in the flow_results.csv
and each entry should contain the columns host, score, rank, word? Also, what does "word"
stand for?

Thank you

On 01/31/2018 05:12 PM, Lujan Moreno, Gustavo wrote:
Hi Christos,

If you are able to run 100 iterations it is fine, you can start analyzing your results. Try
to reach 200 iterations as a second priority now.

I have doubts about the way you are computing the AUROC. You said that you are only using
the first 100 values of the results. I assume they are ordered and ranked. It is not totally
incorrect to use the first n values but that is not the usual AUROC. You need to compute it
over all the dataset. In order to do this you don’t have to know the rank of every single
row, just the rank of the attacks which we assume are way fewer than normal data. I’m attaching
a R script which computes the AUROC given the TP and FP which is also computed in the script.
You need to specify the size of your normal dataset + attacks. This is a very efficient way
of computing the AUROC because you are only using the ranks of the attacks. The script is
designed to read several results with a given file name pattern. You only requirement is to
read an input file with columns: host, score, rank, word, although you can modify this. It
also plots the ROC.

Give it a try. You may need to modify a couple of lines.

Best,

Gustavo



setwd("/Users/galujanm/Documents/R/Spark21_tuning/proxy2")

#Call needed libraries

library(pROC)

library(ggplot2)

#This is a function to insert row given a location

insertRow <- function(existingDF, newrow, r) {

  existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]

  existingDF[r,] <- newrow

  existingDF

}



totalRows <- 722992037 #Only normal data

totalRows <- totalRows + 200. #I’m injecting 200 attacks in this case



runs <-  list.files(pattern = "spark21_proxy2_4_em*")



masterDF <- data.frame()



for (jj in 1:length(runs)){



  #readlines of file

  all_content = readLines(runs[jj])

  #skip the first 4 lines. This could vary depending on the csv file

  skip = all_content[-c(1:8)]

  #read the csv file

  df = read.csv(textConnection(skip), header = F, stringsAsFactors = FALSE)

  #Very important line

  df <- df[!duplicated(df),]

  #assign the column names

  colnames(df) <- c('host','score','rank','word')

  #order by rank

  df <- df[with(df, order(rank)),]

  #everything is bad here

  df$label <- 'bad'

  #Value just to plot

  df$y = 1

  #dataframe to be plotted

  toplot <- df[,c('rank','y')]

  #add a little bit of horizontal jitter to plot

  toplot$rank <- toplot$rank + as.integer((rnorm(nrow(toplot), sd = 10)))

  #adding a little bit of vertical jitter

  toplot$y <-  rnorm(nrow(toplot))*.05

  #Create true positive and false positive columns

  df$TP <- -99

  df$FP <- -99

  #insert row

  df <- insertRow(df,c(1,1,0,0,'null',1,-99,-99),1)

  #convert to numeric

  df$rank <- as.numeric(df$rank)

  #order by rank, although I thought it was already order, anyways...

  df <- df[order(df$rank),]

  #row bind to df

  df <- rbind(df,c(1,1,0,0,'null',1,1,1))

  #Convert to numeric, or make sure everything is numeric

  df$rank <- as.numeric(df$rank)

  df$TP <- as.numeric(df$TP)

  df$FP <- as.numeric(df$FP)

  df$TP[1] <- 0

  df$FP[1] <- 0

  #next lines compute the TP and FP rate

  for (i in 2:(nrow(df)-1)){

    df$TP[i] <- (i-1)/(nrow(df)-2)

    #Formula corrected

    df$FP[i] <- (df$rank[i]-i+1)/ (totalRows - nrow(df)-2)

  }

  #Next line computes the ACU

  AUC <- 0

  for (i in 1:(nrow(df)-1)){

    AUC <- AUC + (df$TP[i+1]) *(df$FP[i+1]-df$FP[i])

  }

  #Tracks the AUC for each of the replicates

  df$y <- c(0:(nrow(df)-1))

  #Write results of the df for a single replicate



  #Plot and save ranks visualization

  ggplot(toplot,aes(x=rank,y=y)) + geom_point(pch = 21,position = position_jitter(width =
1)) +

    ylim(-0.5,.5) + xlim(-round(totalRows*.1,0), totalRows) +

    ggtitle("Rank for anomalies")

  #ggsave('rank.png',width = 15, height =10, units = 'cm')

  #Plot and save the last replicate ACU

  ggplot(df, aes(FP,TP)) + geom_line() +

    xlab("FP (1-specificity)" ) + ylab("TP (sensititiy)") +

    ggtitle('ROC-AUC Proxy') +

    geom_abline(slope=1, intercept=0)



  print(runs[jj])

  print(AUC)

}






From: Christos Mathas <mathas.ch.m@gmail.com><mailto:mathas.ch.m@gmail.com>
Reply-To: "user@spot.incubator.apache.org"<mailto:user@spot.incubator.apache.org> <user@spot.incubator.apache.org><mailto:user@spot.incubator.apache.org>
Date: Wednesday, January 31, 2018 at 3:33 AM
To: "user@spot.incubator.apache.org"<mailto:user@spot.incubator.apache.org> <user@spot.incubator.apache.org><mailto:user@spot.incubator.apache.org>
Subject: Re: Spot-ml parameters configuration


Hi,

I tried increasing the number of max iterations in steps. I did it with 50, 70, 100, 150 and
200. It worked for all numbers except for 200, where I got this:

[Stage 37393:==================================================>(198 + 2) / 200]Exception
in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization
failed: java.lang.StackOverflowError

and the whole output of ml_ops.sh:

https://www.dropbox.com/s/maq0efs8c0xb8ga/ml.out?dl=0
I should also note that the last time it failed with the iterations set at 200, the error
was about a task failing because Yarn lost an executor, but again due to java.lang.StackOverflowError.
The only thing I have changed between the two executions is this: export JAVA_TOOL_OPTIONS="-Xmx16G",
which I set in /etc/profile.d/ for everything in the system, to check if it was the default
java memory value that was causing the problem.

So still stuck at fixing whatever is causing this. I don't have any more leads as to which
configuration I should examine. When I do, I will also check the topics parameter as you suggested.

On your other questions:
"How large is your dataset?"
I will give as an example the dataset I used for the executions of ml_ops with the different
iteration values as described above, which is ~1Mb. Here is a screenshot from hdfs:

https://www.dropbox.com/s/c6hp937zvl77s3p/hdfs_hive.png?dl=0

"How many attacks.." "How are you generating.."
I have a Kali VM inside the network from which I deploy the attacks in real-time while the
traffic is generated. I do one attack at a time. For this particular dataset I used Armitage
to do an nmap scan and a Hail Mary attack which is a really "noisy" and not sophisticated
attack.

"How are you computing AUROC?"
I am computing AUROC with MS Excel, by taking as input the first 100 rows of the flow_results.csv.
I have uploaded a file in dropbox so it is more clear as to how exactly I'm doing it.

https://www.dropbox.com/s/7k6erupp5jbpnpm/Hail%20Mary_9_ROC.xlsx?dl=0

Thank you
On 01/29/2018 04:49 PM, Lujan Moreno, Gustavo wrote:
Hi,

The number of iterations at 100 should be a priority. If you are not able to run them you
have a technical problem with your cluster and you should fix it first. Once you are able
to run 100 iterations then you should start playing around with the number of topics. Start
with 5 topics, run 10 replicates, save the results (AUROC), change to 10 topics, do the same,
then go to 20, 50, 100, etc. Once you have the results, plot them (x axis number of topics,
y axis the AUROC) and you will be able to see the pattern or trend as well as the variation
and central tendency for each of the replicates. Visually, it should be clear which number
of topics works best. Finally, just to make this statistically sound run a pair-wise comparison
(Tukey’s test for example) where the number of topics is your main factor. This statistical
analysis is just to prove significance of results. For example, if you visually see that 10
topics is better than 5 but the test says there are no statistical differences then there
is no point in running at 10 topics because is more computational expensive, you might as
well just run it at numTopic = 5.

Other questions to consider:

How large is your dataset?
How many attacks are you injecting?
How are you generating the attacks?
How are you computing the AUROC?

An AUROC of 0.52 basically tells you that you are finding nothing but randomness.

Best,

Gustavo


From: Christos Minas Mathas <mathas.ch.m@gmail.com><mailto:mathas.ch.m@gmail.com>
Reply-To: "user@spot.incubator.apache.org"<mailto:user@spot.incubator.apache.org> <user@spot.incubator.apache.org><mailto:user@spot.incubator.apache.org>
Date: Tuesday, January 23, 2018 at 10:36 AM
To: "user@spot.incubator.apache.org"<mailto:user@spot.incubator.apache.org> <user@spot.incubator.apache.org><mailto:user@spot.incubator.apache.org>
Subject: Spot-ml parameters configuration


Hi,

I've been evaluating the Netflow component of Spot for quite some time now by using different
kinds of attacks and collect the results. I'm using the default configuration, I haven't changed
any of the parameters and the results I'm getting are not good. I was reading in the users
mailing list some responses from Gustavo Lujan Moreno back in June 2017 in which he said about
the results they're getting:

"On proxy we are getting > 0.90 on AUC and on net flow >0.99."

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C1DD58ED7-BEE5-47E6-8886-537EE480E3E1%40intel.com%3E

My results in terms of AUROC are more like ~0.52 or worse.

He also gave some tips about configuring the parameters of spot-ml. So I thought I'd try them.

". . ."--ldamaxiterations 20” is the iteration parameter. You should change that 20 for
something higher, at least 100, ideally +200.
. . .
If you are not getting good results the number of iterations and topics should be your priority."

http://mail-archives.apache.org/mod_mbox/spot-user/201706.mbox/%3C4F588C3D-B453-466F-BBCB-F7F1ABE7CC8D%40intel.com%3E

1. I changed ldamaxiterations to 200 but after running for ~60000 stages and 2 and a half
hours there wasn't enough RAM in one of the associated VMs and ml_ops exited with a StackOverflowException.
So I assigned 32GB of RAM to each one of the three VMs associated and this time it stopped
at ~20000 stages again with a StackOverflow from another one of the associated VMs. How much
RAM would I need for 200 iterations and for which services?

2. Can someone explain how can I properly configure the parameters of spot-ml? Like for the
topic count for example, how can I calculate an approximate value of topics based on the traffic
and the network setup?

If you need further information on my setup or the results I'm getting just let me know.

Thanks in advance





Mime
View raw message