spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Barona, Ricardo" <ricardo.bar...@intel.com>
Subject Re: Reading custom flow data
Date Thu, 09 Mar 2017 15:58:13 GMT
That’s right, ODM (open data model) is planned for the future, none of the Spot components
are leveraging ODM for now.

To answer your question Giacomo, I can only think about two solutions for now:
   
1. If your current data is not used for any other process as data source, you can write a
simple Spark Job to transform and rename columns, save that new data and delete your original
data set, if you need to keep your original data, then I think you can do the same and have
duplicated data for now.
2. Patch spot-ml: This might involve different things but it doable.
a. You need to update ml_ops.sh. ml_ops.sh is the main script running the Spark job, it receives
the parameters of the date you want to process, the type of data and the results you want
to save. Since ml_ops.sh works with a date you either need to reorganize your data to follow
a similar structure like /user/<spot-user>/flow/hive/y=2017/m=03/d=09 so you can keep
using this script.
Another option is to use ml_test.sh. This script is made just for testing data sets without
having the structure I mentioned. By doing so, you will need to change some parameters inside
the script that are hardcoded (it’s just a test script) to save results in a dynamic location,
get a specific amount of results, etc.
I’m talking about:
DSOURCE=$1
RAWDATA_PATH=$2
TOL=1.1
MAXRESULTS=20
HPATH=${HUSER}/${DSOURCE}/test/scored_results
b. Depending on the data type you want to implement (netflow, dns queries, proxy) you are
going to have to map your columns with our existing columns.
For flow you can check this particular object: https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/FlowSchema.scala

There you are going to see the name of columns we are using assigned to constants like this:
val TimeReceived = "treceived"
val TimeReceivedField = StructField(TimeReceived, StringType, nullable = true)
You can then, after you did a mapping of your columns with spot’s, just change the value
for that String. For the output, you will need to preserve Spot’s names, for that, you need
to change the name of the column in the StructField to match the old column name. Your code
should look like this:

val TimeReceived = "mytimecolumn"
val TimeReceivedField = StructField("treceived", StringType, nullable = true)

Where the String constant has your column name but the StructField has the old name, again,
that’s for output.

I’m trying to think about what other places (for DNS and Proxy it should be pretty much
the same). I’ll write another email if I remember something else but for now.
Let me know how it goes.

Thanks.

On 3/8/17, 11:12 AM, "Giacomo Bernardi" <mino@minux.it> wrote:

    Thanks,
    I had seen a couple of references to the ODM in the Spot docs:
    
    http://spot.incubator.apache.org/project-components/open-data-models/
    https://github.com/apache/incubator-spot/blob/master/docs/open-data-model/open-data-model.md
    
    but I got confused, as I didn't understand if this is actually used or it's
    a future/planned feature. Can anyone clarify, please?
    
    Thanks,
    Giacomo
    
    
    On 7 March 2017 at 16:53, Michael Ridley <mridley@cloudera.com> wrote:
    
    > Hi Giacomo-
    >
    > Don't have any advice on what you are trying to do, but I think the end
    > goal is to have everything leverage the common data models in Spot.  So I
    > think the recommendation would be to figure out a way to convert your data
    > to the common data model.  But I don't think the Spot ML code actually
    > leverages the common data model yet, so that's more of a future solution.
    >
    > If anyone knows better, feel free to correct me.
    >
    > Michael
    >
    > On Tue, Mar 7, 2017 at 10:57 AM, Giacomo Bernardi <mino@minux.it> wrote:
    >
    > > Hi,
    > > let me ask a suggestion on how to proceed:
    > >
    > > I already have flow data stored HDFS in Parquet files from an existing
    > > netflow receiver system, but with different columns/schema than Spot. I'd
    > > like to patch spot-ml and spot-oa to have them run directly on that data
    > > without having to store everything twice.
    > >
    > > I'm still figuring out the parsing code, any hints on how I should do
    > this?
    > > Or, even better, how to do it in a sane/modular way that can be useful
    > for
    > > everyone?
    > >
    > > Thanks a lot!
    > > Giacomo
    > >
    >
    >
    >
    > --
    > Michael Ridley <mridley@cloudera.com>
    > office: (650) 352-1337
    > mobile: (571) 438-2420
    > Senior Solutions Architect
    > Cloudera, Inc.
    >
    

Mime
View raw message