spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markgrover <...@git.apache.org>
Subject [GitHub] incubator-spot pull request #7: Open Data Model
Date Tue, 10 Jan 2017 21:37:57 GMT
Github user markgrover commented on a diff in the pull request:

    https://github.com/apache/incubator-spot/pull/7#discussion_r95460852
  
    --- Diff: docs/Open Data Model/Open Data Model.md ---
    @@ -0,0 +1,755 @@
    +Overview............................................................................
2
    +
    +Apache Spot Open Data Model Strategy....................................................................................
2
    +
    +Apache Spot Enabled Use Cases...................................................................................................
3
    +
    +Data Model.....................................................................................................................................
4
    +
    +Naming Convention.......................................................................................................................
5
    +
    +Prefixes.........................................................................................................................................................
5
    +
    +Security Event Log/Alert Data Model..........................................................................................
6
    +
    +Common..........................................................................................................................................................
7
    +
    +Network...........................................................................................................................................................
9
    +
    +File................................................................................................................................................................
10
    +
    +Endpoint........................................................................................................................................................
11
    +
    +User...............................................................................................................................................................
11
    +
    +DNS.............................................................................................................................................................
11
    +
    +Proxy.........................................................................................................................................................
12
    +
    +HTTP..............................................................................................................................................................
13
    +
    +SMTP............................................................................................................................................................
14
    +
    +FTP.............................................................................................................................................................
15
    +
    +SNMP....................................................................................................................................................
16
    +
    +TLS...........................................................................................................................................................
16
    +
    +SSH...............................................................................................................................................................
17
    +
    +DHCP.............................................................................................................................................................
17
    +
    +IRC................................................................................................................................................................
17
    +
    +Flow............................................................................................................................................................
17
    +
    +Context Models............................................................................................................................
18
    +
    +User Context Model.......................................................................................................................
18
    +
    +Endpoint Context Model..........................................................................................................................
20
    +
    +Network Context Model............................................................................................................................22
    +
    +Extensibility of Data Model.........................................................................................................
23
    +
    +Model Relationships....................................................................................................................
24
    +
    +Data Ingestion Framework..........................................................................................................
24
    +
    +Data Formats................................................................................................................................
25
    +
    +Avro...............................................................................................................................................................
25
    +
    +JSON......................................................................................................................................................
27
    +
    +Parquet...................................................................................................................................................
27
    +
    +ODM Resultant Capability - A Singular View............................................................................
28
    +
    +**Example - Advanced Threat Modeling**...................................................................................................
28
    +
    +**Example - Singular Data View for Complete Context**.................................................................
29
    +
    +
    +
    +**Overview**
    +----
    +
    +This document describes a strategy for creating an open data model (ODM) for Apache Spot
(incubating) (formerly known as “Open Network Insight (ONI)”) in support of cyber security
analytic use cases. It also describes the use cases for which Apache Spot (incubating) running
on the Cloudera platform is uniquely capable of addressing along with the data model.
    +
    +
    +
    +**Apache Spot (incubating) Open Data Model Strategy**
    +------------------------------------
    +
    +The Apache Spot (incubating) Open Data Model (ODM) strategy aims to extend Apache Spot
(incubating) capabilities to support a broader set of cyber security use cases than initially
supported. The primary use case initially supported by Apache Spot (incubating) includes Network
Traffic Analysis for network flows (Netflow, sflow, etc.), DNS and Proxy; primarily the identification
of threats through anomalous event detection using both supervised and unsupervised machine
learning.
    +
    +In order to support a broader set of use cases, Spot must be extended to collect and
analyze other common
    +“event-oriented” data sources analyzed for cyber threats, including but not limited
to the following log types:
    +
    +> ●Proxy
    +> 
    +> ●Web server
    +> 
    +> ●Operating system
    +> 
    +> ●Firewall
    +> 
    +> ●Intrusion Prevention/Detection (IDS/ IPS)
    +> 
    +> ●Data Loss Prevention
    +> 
    +> ●Active Directory / Identity Management
    +> 
    +> ●User/Entity Behavior Analysis
    +> 
    +> ●Endpoint Protection/Asset Management
    +> 
    +> ●Network Metadata/Session and PCAP files
    +> 
    +> ●Network Access Control
    +> 
    +> ●Mail
    +> 
    +> ●VPN
    +> 
    +> ● etc..
    +
    +One of the biggest challenges organizations face today in combating cyber threats is
collecting and normalizing data from the myriad of security event data sources (hundreds)
in order to build the needed analytics. This often results in the analytics being dependent
upon the specific technologies used by an organization to detect threats and prevents the
needed flexibility and agility to keep up with these ever-increasing (and complex) threats.&nbsp;
Technology lock-in is sometimes a byproduct of today’s status quo, as it’s extremely costly
to add new technologies (or replace existing ones) because of the downstream analytic dependencies.
    +
    +To achieve the goal of extending Apache Spot (incubating) to support additional use cases,
it is necessary to create an open data model for the most relevant security event and contextual
data sources; Security event logs or alerts, Network context, User details and information
that comes from the endpoints or any other console that are being use to manage the security
/ administration of our endpoints. The presence of an open data model, which can be applied
“on-read” or “on-write”, in batch or stream, will allow for the separation of security
analytics from the specific data sources on which they are built. This “separation of duties”
will enable organizations to build analytics that are not dependent upon specific technologies
and provide the flexibility to change underlying data sources and also provide segmentation
of this information, without impacting the analytics. This will also afford security vendors
the opportunity to build additional products on top of t
 he Open Data Model to drive new revenue streams and also to design new ways to detect threats
and APT.
    +
    +
    +**Apache Spot (incubating) Enabled**
    +----
    +
    +**Use Cases**
    +-------------
    +
    +Spot on the Cloudera platform is uniquely positioned to help address the following cyber
security use cases,
    +which are not effectively addressed by legacy technologies:
    +
    + 
    +
    + **- Detection of known & unknown threats leveraging machine learning and advanced
analytic modeling**
    +
    +Current technologies are limited in the analytics they can apply to detect threats. These
limitations stem from the inability to collect all the data sources needed to effectively
identify threats (structured, unstructured, etc.) and inability to process the massive volumes
of data needed to do so (billions of events per day). Legacy technologies are typically focus
and limited to rules-based and signature detection. They are somewhat “effective” at detecting
known threats but struggle with new threats.
    +
    +Spot addresses these gaps through its ability to collect any data type of any volume.
Coupled with the various analytic frameworks that are provided (including machine learning),
Spot enables a whole new class of analytics that can scale to today’s demands. The topic
model used by Spot to detect anomalous network traffic is one example of where the Spot platform
excels.
    +
    + **- Reduction of mean time to incident detection & resolution (MTTR)**
    +
    +One of the challenges organizations face today is detecting threats early enough to minimize
adverse impacts. This stems from the limitations previously discussed with regards to limited
analytics. It can also be attributed to the fact that most of the investigative queries often
take hours or days to return results. Legacy technologies can’t offer or have a central
data store for facilitating such investigations due to their inability to store and serve
the massive amounts of data involved. This cripples incident investigations and results in
MTTRs of many weeks or months, meanwhile the adverse impacts of the breach are magnified,
thus making the threat harder to eradicate.
    +
    +Apache Spot (incubating) addresses these gaps by providing the capability for a central
data store that houses ALL the data needed to facilitate an investigation, returning investigative
query results in seconds and minutes (vs. hours and days). Spot can effectively reduce incident
MTTR and reduce adverse impacts of a breach.
    +
    + **- Threat Hunting**
    +
    +It’s become necessary for organizations to “hunt” for active threats as traditional
passive threat detection approaches are not sufficient. “Hunting” involves performing
ad-hoc searches and queries over vast amounts of data representing many weeks and months’
worth of events, as well as applying ad-hoc / tune algorithms to detect the needle in the
haystack. Traditional systems do not perform well for these types of activities as the query
results sometimes take hours and days to be retrieved. These traditional systems also lack
the analytic flexibility to construct the necessary algorithms and logic needed.
    +
    +Apache Spot (incubating) addresses these gaps in the same ways it addresses others; by
providing a central data store with the needed analytic frameworks that scale to the needed
workloads.
    +
    +**Data Model**
    +----------
    +In order to provide a framework for effectively analyzing data for cyber threats, it
is necessary to collect and
    +analyze standard security event logs/alerts and contextual data regarding the entities
referenced in these logs/alerts. The most common entities include network, user and endpoint,
but there are others such as file.
    +
    +In the diagram below, the raw event tells us that user “jsmith” successfully logged
in to an Oracle database from the IP address 10:1.1.3. Based on the raw event only, we don’t
know if this event is a legitimate threat or not. After injecting user and endpoint context,
the enriched event tells us this event is a potential threat that requires further investigation.
    +
    +![Screen Shot 2016-09-22 at 1.11.28 PM.png](CybersecurityOpenDataModel0%204-3_files/image001.jpg)
    +
    +Based on the need to collect and analyze both security events, logs or alerts and contextual
data, support for
    +the following types of security information are planned for inclusion in the Spot Open
Data Model:
    +
    + - Security event logs/alerts
    +This data type includes event logs from common data sources used to detect threats and
includes network flows, operating system logs, IPS/IDS logs, firewall logs, proxy logs, web
logs, DLP logs, etc.
    +
    + - Network context data
    +This data type includes information about the network, which can be gleaned from Whois
servers, asset databases and other similar data sources.
    +
    + - User context data
    +This data type includes information from user and identity management systems including
Active Directory, Centrify, and other identity and access management systems.
    +
    + - Endpoint context data
    +This data includes information about endpoint systems (servers, workstations, routers,
switches, etc.) and can be sourced from asset management systems, vulnerability scanners,
and endpoint  management/detection/response systems such as Webroot, Tanium, Sophos, Endgame,
CarbonBlack, Intel Security ePO and others.
    +
    + - File context data** (ROADMAP ITEM)**
    +This data includes contextual information about files and can be sourced from systems
such as FireEye, Application Control and others.
    +
    + - Threat intelligence context data **(ROADMAP ITEM)**
    +This data includes contextual information about URLs, domains, websites, files and others.
    +
    +**Naming Convention**
    +-----------------
    +
    +A naming convention is needed for the Open Data Model to represent common attributes
across vendor products and technologies. The naming convention is described below.
    +
    +**Prefixes**
    +--------
    +
    +|  Prefix | Description  |  
    +|---|---|
    +|  src | Corresponds to the “source” fields within a given event (i.e. source address)|
 
    +|  dst | Corresponds to the “destination” fields within a given event (i.e. destination
address) |  
    +|  dvc | Corresponds to the “device” applicable fields within a given event (i.e.
device address) and represent where the event originated  |  
    +| fwd  | Forwarded from device   |  
    +| request | Corresponds to requested values (vs. those returned, i.e. “requested URI”)
|  
    +| response  | Corresponds to response value (vs. those requested) |  
    +| file  |  Corresponds to the “file” fields within a given event (i.e. file type)
|  
    +| user  | Corresponds to user attributes (i.e. name, id, etc.)  |  
    +| xlate  | Corresponds to translated values within a given event (i.e. src_xlate_ip for
“translated source ip address” |  
    +| in  | Ingress|  
    +| out | Egress |  
    +| new | New value |  
    +| orig | Original value |  
    +| app | Corresponds to values associated with application events |  
    +
    +
    +**Security Event Log/Alert Data Model**
    +-----------------------------------
    +
    +The data model for security event logs/alerts is detailed in the below. The attributes
are categorized as follows:
    +
    + - Common -attributes that are common across many device types
    + - Device -attributes that are applicable to the device that generated the event
    + - File -attributes that are applicable to file objects referenced in the event
    + - Endpoint -attributes that are applicable to the endpoints referenced in the event
    + - User- attributes that are applicable to the user referenced in the event
    + - Proxy - attributes that are applicable to proxy events
    + - Protocol
    +
    +> DNS - attributes that are specific to DNS events
    +> HTTP - attributes that are specific to HTTP events
    +> SMTP, SSH, TLS, DHCP, IRC, SNMP and FTP
    +
    +Note: The model will evolve to include reserved attributes for additional device types
that are not currently represented. The model can currently be extended to support ANY attribute
for ANY device type by following the guidance outlined in the section titled **“Extensibility
of Data Model”.**
    +
    +Note: Attributes denoted in BLUE represent those that are listed in the model multiple
times for the purpose of
    --- End diff --
    
    I don't know if markdown can do color. If not, we should change BLUE to bold. And, change
all the blue items in the original doc to be bolded.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message