spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nate Smith <natedogs...@gmail.com>
Subject Re: Configuration-driven ingest for the Open Data Model (ODM) using Spark Streaming (Envelope)
Date Tue, 01 May 2018 22:28:26 GMT
Curtis, 

Have you tested this with a standard version of nfdump? Or only spot-nfdump?

- Nathanael

> On May 1, 2018, at 1:12 PM, Curtis Howard <curtis@cloudera.com> wrote:
> 
> Hi all,
> 
> We had discussed prototyping Envelope for ingest in the past - I've
> submitted a PR for this which includes:
>  - Kafka -> Spark streaming -> ODM Hive table applications for dns, flow
> and proxy raw source data
>  - a simple alternative for source data collection/dissection using
> tshark/nfdump/unzip + Flume (sinking data to Kafka)
>  - https://github.com/apache/incubator-spot/pull/144
> 
> To quote directly from the Envelope site (https://github.com/cloudera-
> labs/envelope#envelope):
> *"Envelope is simply a pre-made Spark application that implements many of
> the tasks commonly found in ETL pipelines. In many cases, Envelope allows
> large pipelines to be developed on Spark with no coding required. When
> custom code is needed, there are pluggable points in Envelope for core
> functionality to be extended. Envelope works in batch and streaming modes."*
> 
> For example, the complete Kafka/SparkStreaming/ODM ingest application
> definition for DNS:
> https://github.com/curtishoward/incubator-spot/
> blob/SPOT-181_envelope_ingest/spot-ingest/odm/workers/spot_proxy.conf
> 
> From the perspective of the Spot project, my thoughts are that it would
> enable:
>  - faster turnaround time to ingest new source types while still allowing
> for arbitrarily complex ETL pipelines (data enrichment, data quality
> checks, etc..)
>  - simplify future integration with other storage layers (HBase, Kudu, for
> example)
>  - a framework that is simple to extend (input sources, output storage
> layers, translators, derivers, UDFs, ...)
> 
> If there is interest, I will continue to refactor the current
> implementation - centralize/integration configuration with spot.conf, test
> Kerberos integration, run performance tests and tune as possible.
> 
> In the near term, I will also add a PR with Hive views for dns/flow/proxy
> under spot-ml/ - this should enable an end-to-end proof-of-concept ODM
> implementation using Envelope.
> 
> Thanks
> Curtis


Mime
View raw message