spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Curtis Howard <cur...@cloudera.com>
Subject Re: Configuration-driven ingest for the Open Data Model (ODM) using Spark Streaming (Envelope)
Date Wed, 09 May 2018 20:35:46 GMT
Hi all,

As a follow up to this thread, I've confirmed with the Envelope team that
the next release (0.6.0, ETA later this summer) will move to using upstream
dependencies rather than Cloudera's (for Spark, Kafka, HBase, etc.).
Envelope will also begin taking public code contributions soon - likely
next month.

As I understand it, the general goal is for Envelope to move into the
public OSS space, similar to the paths of other projects like Impala and
Kudu.

Thanks
Curtis

On Thu, May 3, 2018 at 4:48 PM, Tadd Wood <tadd.wood@digitalminion.com>
wrote:

> Curtis,
>
> Excited to take a look as well :).  Thanks for the hard work on this.
>
> Thank you,
> Tadd Wood
>
>
>
> > On May 2, 2018, at 4:45 AM, Austin Leahy <austin@digitalminion.com>
> wrote:
> >
> > Curtis this is very cool thanks for putting so much time into this will
> > check out the PR and comment.
> >
> > On Tue, May 1, 2018 at 3:37 PM Curtis Howard <curtis@cloudera.com>
> wrote:
> >
> >> Hi Nathanael,
> >>
> >> So far only https://github.com/Open-Network-Insight/spot-nfdump.git
> >>
> >> The PR code is a proof-of-concept at this point - look forward to your
> >> thoughts on next steps though!
> >>
> >> Thanks again
> >> Curtis
> >>
> >> On Tue, May 1, 2018 at 6:28 PM, Nate Smith <natedogs911@gmail.com>
> wrote:
> >>
> >>> Curtis,
> >>>
> >>> Have you tested this with a standard version of nfdump? Or only
> >>> spot-nfdump?
> >>>
> >>> - Nathanael
> >>>
> >>>> On May 1, 2018, at 1:12 PM, Curtis Howard <curtis@cloudera.com>
> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> We had discussed prototyping Envelope for ingest in the past - I've
> >>>> submitted a PR for this which includes:
> >>>> - Kafka -> Spark streaming -> ODM Hive table applications for
dns,
> >> flow
> >>>> and proxy raw source data
> >>>> - a simple alternative for source data collection/dissection using
> >>>> tshark/nfdump/unzip + Flume (sinking data to Kafka)
> >>>> - https://github.com/apache/incubator-spot/pull/144
> >>>>
> >>>> To quote directly from the Envelope site (
> https://github.com/cloudera-
> >>>> labs/envelope#envelope):
> >>>> *"Envelope is simply a pre-made Spark application that implements many
> >> of
> >>>> the tasks commonly found in ETL pipelines. In many cases, Envelope
> >> allows
> >>>> large pipelines to be developed on Spark with no coding required. When
> >>>> custom code is needed, there are pluggable points in Envelope for core
> >>>> functionality to be extended. Envelope works in batch and streaming
> >>> modes."*
> >>>>
> >>>> For example, the complete Kafka/SparkStreaming/ODM ingest application
> >>>> definition for DNS:
> >>>> https://github.com/curtishoward/incubator-spot/
> >>>> blob/SPOT-181_envelope_ingest/spot-ingest/odm/workers/spot_proxy.conf
> >>>>
> >>>> From the perspective of the Spot project, my thoughts are that it
> would
> >>>> enable:
> >>>> - faster turnaround time to ingest new source types while still
> >> allowing
> >>>> for arbitrarily complex ETL pipelines (data enrichment, data quality
> >>>> checks, etc..)
> >>>> - simplify future integration with other storage layers (HBase, Kudu,
> >>> for
> >>>> example)
> >>>> - a framework that is simple to extend (input sources, output storage
> >>>> layers, translators, derivers, UDFs, ...)
> >>>>
> >>>> If there is interest, I will continue to refactor the current
> >>>> implementation - centralize/integration configuration with spot.conf,
> >>> test
> >>>> Kerberos integration, run performance tests and tune as possible.
> >>>>
> >>>> In the near term, I will also add a PR with Hive views for
> >> dns/flow/proxy
> >>>> under spot-ml/ - this should enable an end-to-end proof-of-concept ODM
> >>>> implementation using Envelope.
> >>>>
> >>>> Thanks
> >>>> Curtis
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message