spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-spot] ktzoulas opened a new pull request #141: Ingestion using Spark Streaming
Date Wed, 11 Sep 2019 10:25:24 GMT
ktzoulas opened a new pull request #141: Ingestion using Spark Streaming
URL: https://github.com/apache/incubator-spot/pull/141
 
 
   
   ## Ingestion using Spark Streaming
   A new branch of Spot Ingest framework, where ingestion via Spark Streaming [**Spark2**]
is available for flow, dns and proxy. The code was developed without modifying the existing
one, to provide this feature as an extra  functionality. This means that the current functionality
of the Spot Ingest framework still exists and the **Streaming Ingestion** can be used **as
an alternative**.
   
   ### Implementation
   A new collector class [**Distributed Collector**] has been added, which is the same for
all pipelines. The role of the Distributed Collector is similar, as it processes the data
before transmission. Distributed Collector tracks a directory backwards for newly created
files. When a file is detected, it converts it into CSV format and stores the output in the
local staging area. Following to that, reads the CSV file line-by-line and creates smaller
chunks of bytes. The size of each chunk depends on the maximum request size allowed by Kafka.
Finally, it serializes each chunk into an Avro-encoded format and publishes them to Kafka
cluster.
   Due to its architecture, Distributed Collector can run **on an edge node** of the Big Data
infrastructure as well as **on a remote host** (proxy server, vNSF, etc).
   Distributed Collector publishes to Apache Kafka only the CSV-converted file, and not the
original one. The binary file remains to the local filesystem of the current host.
   
   In contrary, **Streaming Listener** can only run on the central infrastructure. Its ability
is to listen to a specific Kafka topic and consumes incoming messages. Streaming data is divided
into batches (according to a time interval). These batches are deserialized by the Listener,
according to the supported Avro schema, parsed and registered in the corresponding table of
Hive.
   
   For each pipeline, the following modules have been implemented:
   *  **processing** - contains methods that will be used to convert and prepare ingested
data, before being sent to Kafka cluster.
   * **streaming** - contains methods to be used during the streaming process (like table
schema etc.).
   
   ### Advantages
   A preliminary traffic ingestion check for all pipelines (flow,dns,proxy) of the distributed
collector-worker modules has been applied on an existing Spot v1.0 installation. In a summary,
the module has flawless performance on netflow files (both .nfcapd and .csv), order of **magnitude
quicker** compared to standard version's ingestion. DNS and proxy ingestion were tested as
well with similar results.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message