spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tadd Wood <tadd.w...@digitalminion.com>
Subject Re: Configuration-driven ingest for the Open Data Model (ODM) using Spark Streaming (Envelope)
Date Thu, 03 May 2018 20:48:26 GMT
Curtis,

Excited to take a look as well :).  Thanks for the hard work on this.

Thank you,
Tadd Wood



> On May 2, 2018, at 4:45 AM, Austin Leahy <austin@digitalminion.com> wrote:
> 
> Curtis this is very cool thanks for putting so much time into this will
> check out the PR and comment.
> 
> On Tue, May 1, 2018 at 3:37 PM Curtis Howard <curtis@cloudera.com> wrote:
> 
>> Hi Nathanael,
>> 
>> So far only https://github.com/Open-Network-Insight/spot-nfdump.git
>> 
>> The PR code is a proof-of-concept at this point - look forward to your
>> thoughts on next steps though!
>> 
>> Thanks again
>> Curtis
>> 
>> On Tue, May 1, 2018 at 6:28 PM, Nate Smith <natedogs911@gmail.com> wrote:
>> 
>>> Curtis,
>>> 
>>> Have you tested this with a standard version of nfdump? Or only
>>> spot-nfdump?
>>> 
>>> - Nathanael
>>> 
>>>> On May 1, 2018, at 1:12 PM, Curtis Howard <curtis@cloudera.com> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> We had discussed prototyping Envelope for ingest in the past - I've
>>>> submitted a PR for this which includes:
>>>> - Kafka -> Spark streaming -> ODM Hive table applications for dns,
>> flow
>>>> and proxy raw source data
>>>> - a simple alternative for source data collection/dissection using
>>>> tshark/nfdump/unzip + Flume (sinking data to Kafka)
>>>> - https://github.com/apache/incubator-spot/pull/144
>>>> 
>>>> To quote directly from the Envelope site (https://github.com/cloudera-
>>>> labs/envelope#envelope):
>>>> *"Envelope is simply a pre-made Spark application that implements many
>> of
>>>> the tasks commonly found in ETL pipelines. In many cases, Envelope
>> allows
>>>> large pipelines to be developed on Spark with no coding required. When
>>>> custom code is needed, there are pluggable points in Envelope for core
>>>> functionality to be extended. Envelope works in batch and streaming
>>> modes."*
>>>> 
>>>> For example, the complete Kafka/SparkStreaming/ODM ingest application
>>>> definition for DNS:
>>>> https://github.com/curtishoward/incubator-spot/
>>>> blob/SPOT-181_envelope_ingest/spot-ingest/odm/workers/spot_proxy.conf
>>>> 
>>>> From the perspective of the Spot project, my thoughts are that it would
>>>> enable:
>>>> - faster turnaround time to ingest new source types while still
>> allowing
>>>> for arbitrarily complex ETL pipelines (data enrichment, data quality
>>>> checks, etc..)
>>>> - simplify future integration with other storage layers (HBase, Kudu,
>>> for
>>>> example)
>>>> - a framework that is simple to extend (input sources, output storage
>>>> layers, translators, derivers, UDFs, ...)
>>>> 
>>>> If there is interest, I will continue to refactor the current
>>>> implementation - centralize/integration configuration with spot.conf,
>>> test
>>>> Kerberos integration, run performance tests and tune as possible.
>>>> 
>>>> In the near term, I will also add a PR with Hive views for
>> dns/flow/proxy
>>>> under spot-ml/ - this should enable an end-to-end proof-of-concept ODM
>>>> implementation using Envelope.
>>>> 
>>>> Thanks
>>>> Curtis
>>> 
>>> 
>> 


Mime
View raw message