metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zeolla@GMail.com" <zeo...@gmail.com>
Subject Re: Long-term storage for enriched data
Date Fri, 06 Jan 2017 19:28:25 GMT
Does it really need to account for all enrichments off the bat?  I'm not
familiar with these options in practice but my research led me to believe
that adding fields to the Avro schema is not a huge issue, changing or
removing them is the true problem.  I have no proof to substantiate my
claim however, just that I heard that question get asked, and I read
responses from people familiar with Avro reply uniformly in that way.

My thoughts, based off of my assumption, is that we simply need to handle
out of the box enrichments and document a required schema change in our
guides to creating custom enrichments.

In ES we are currently doing one template per sensor which gives us that
overlapping field name (per sensor) flexibility.

Jon

On Fri, Jan 6, 2017, 12:33 PM Kyle Richardson <kylerichardson2@gmail.com>
wrote:

> Thanks, Jon. Really interesting talk.
>
> For the GitHub data set discussed (which probably most closely mimics
> Metron data due to number of fields and overall diversity), Avro with
> Snappy compression seemed like the best balance of storage size and
> retrieval time. I did find it interesting that he said Parquet was
> originally developed for log data sets but didn't perform as well on the
> GitHub data.
>
> I think our challenge is going to be on the schema. Would we create a
> schema per sensor type and try to account for all of the possible
> enrichments? Problem there is that similar data may not be mapped to the
> same field names across sensors. We may need to think about expanding our
> base JSON schema beyond these 7 fields (
> https://cwiki.apache.org/confluence/display/METRON/Metron+JSON+Object) to
> account for normalizing things like URL, user name, and disposition (e.g.
> whether an action was allowed or denied).
>
> Thoughts?
>
> -Kyle
>
> On Tue, Jan 3, 2017 at 11:30 AM, Zeolla@GMail.com <zeolla@gmail.com>
> wrote:
>
> > For those interested, I ended up finding a recording of the talk itself
> > when doing some Avro research - https://www.youtube.com/watch?
> > v=tB28rPTvRiI
> >
> > Jon
> >
> > On Sun, Jan 1, 2017 at 8:41 PM Matt Foley <mattf@apache.org> wrote:
> >
> > > I’m not an expert on these things, but my understanding is that Avro
> and
> > > ORC serve many of the same needs.  The biggest difference is that ORC
> is
> > > columnar, and Avro isn’t.  Avro, ORC, and Parquet were compared in
> detail
> > > at last year’s Hadoop Summit; the slideshare prezo is here:
> > > http://www.slideshare.net/HadoopSummit/file-format-
> > benchmark-avro-json-orc-parquet
> > >
> > > It’s conclusion: “For complex tables with common strings, Avro with
> > Snappy
> > > is a good fit.  For other tables [or when applications “just need a few
> > > columns” of the tables], ORC with Zlib is a good fit.”  (The addition
> in
> > > square brackets incorporates a quote from another part of the prezo.)
> > But
> > > do look at the prezo please, it gives detailed benchmarks showing when
> > each
> > > one is better.
> > >
> > > --Matt
> > >
> > > On 1/1/17, 5:18 AM, "Zeolla@GMail.com" <zeolla@gmail.com> wrote:
> > >
> > >     I don't recall a conversation on that product specifically, but
> I've
> > >     definitely brought up the need to search HDFS from time to time.
> > > Things
> > >     like Spark SQL, Hive, Oozie have been discussed, but Avro is new to
> > me
> > > I'll
> > >     have to look into it.  Are you able to summarize it's benefits?
> > >
> > >     Jon
> > >
> > >     On Wed, Dec 28, 2016, 14:45 Kyle Richardson <
> > kylerichardson2@gmail.com
> > > >
> > >     wrote:
> > >
> > >     > This thread got me thinking... there are likely a fair number of
> > use
> > > cases
> > >     > for searching and analyzing the output stored in HDFS. Dima's use
> > > case is
> > >     > certainly one. Has there been any discussion on the use of Avro
> to
> > > store
> > >     > the output in HDFS? This would likely require an expansion of the
> > > current
> > >     > json schema.
> > >     >
> > >     > -Kyle
> > >     >
> > >     > On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <
> cestella@gmail.com>
> > > wrote:
> > >     >
> > >     > > Oozie (or something like it) would appear to me to be the
> correct
> > > tool
> > >     > > here.  You are likely moving files around and pinning up hive
> > > tables:
> > >     > >
> > >     > >    - Moving the data written in HDFS from
> > > /apps/metron/enrichment/${
> > >     > > sensor}
> > >     > >    to another directory in HDFS
> > >     > >    - Running a job in Hive or pig or spark to take the JSON
> > blobs,
> > > map
> > >     > them
> > >     > >    to rows and pin it up as an ORC table for downstream
> analytics
> > >     > >
> > >     > > NiFi is mostly about getting data in the cluster, not really
> for
> > >     > scheduling
> > >     > > large-scale batch ETL, I think.
> > >     > >
> > >     > > Casey
> > >     > >
> > >     > > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov <
> > > Dima.Kovalyov@sstech.us>
> > >     > > wrote:
> > >     > >
> > >     > > > Thank you for reply Carolyn,
> > >     > > >
> > >     > > > Currently for the test purposes we enrich flow with Geo
and
> > > ThreatIntel
> > >     > > > malware IP, but plan to expand this further.
> > >     > > >
> > >     > > > Our dev team is working on Oozie job to process this. So
> > > meanwhile I
> > >     > > > wonder if I could use NiFi for this purpose (because we
> already
> > > using
> > >     > it
> > >     > > > for data ingest and stream).
> > >     > > >
> > >     > > > Could you elaborate why it may be overkill? The idea is
to
> have
> > >     > > > everything in one place instead of hacking into Metron
> > libraries
> > > and
> > >     > > code.
> > >     > > >
> > >     > > > - Dima
> > >     > > >
> > >     > > > On 12/22/2016 02:26 AM, Carolyn Duby wrote:
> > >     > > > > Hi Dima -
> > >     > > > >
> > >     > > > > What type of analytics are you looking to do?  Is the
> > > normalized
> > >     > format
> > >     > > > not working?  You could use an oozie or spark job to create
> > > derivative
> > >     > > > tables.
> > >     > > > >
> > >     > > > > Nifi may be overkill for breaking up the kafka stream.
> Spark
> > >     > streaming
> > >     > > > may be easier.
> > >     > > > >
> > >     > > > > Thanks
> > >     > > > > Carolyn
> > >     > > > >
> > >     > > > >
> > >     > > > >
> > >     > > > > Sent from my Verizon, Samsung Galaxy smartphone
> > >     > > > >
> > >     > > > >
> > >     > > > > -------- Original message --------
> > >     > > > > From: Dima Kovalyov <Dima.Kovalyov@sstech.us>
> > >     > > > > Date: 12/21/16 6:28 PM (GMT-05:00)
> > >     > > > > To: dev@metron.incubator.apache.org
> > >     > > > > Subject: Long-term storage for enriched data
> > >     > > > >
> > >     > > > > Hello,
> > >     > > > >
> > >     > > > > Currently we are researching fast and resources efficient
> way
> > > to save
> > >     > > > > enriched data in Hive for further Analytics.
> > >     > > > >
> > >     > > > > There are two scenarios that we consider:
> > >     > > > > a) Use Ozzie Java job that uses Metron enrichment classes
> to
> > >     > "manually"
> > >     > > > > enrich each line of the source data that is picked
up from
> > the
> > > source
> > >     > > > > dir (the one that we have developed already and using).
> That
> > is
> > >     > > > > something that we developed on our own. Downside: custom
> code
> > > that
> > >     > > built
> > >     > > > > on top of Metron source code.
> > >     > > > >
> > >     > > > > b) Use NiFi to listen for indexing Kafka topic ->
split
> > stream
> > > by
> > >     > > source
> > >     > > > > type -> Put every source type in corresponding Hive
table.
> > >     > > > >
> > >     > > > > I wonder, if someone was going any of this direction
and if
> > > there are
> > >     > > > > best practices for this? Please advise.
> > >     > > > > Thank you.
> > >     > > > >
> > >     > > > > - Dima
> > >     > > > >
> > >     > > > >
> > >     > > >
> > >     > > >
> > >     > >
> > >     >
> > >     --
> > >
> > >     Jon
> > >
> > >     Sent from my mobile device
> > >
> > >
> > >
> > >
> > > --
> >
> > Jon
> >
> > Sent from my mobile device
> >
>
-- 

Jon

Sent from my mobile device

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message