metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <ma...@apache.org>
Subject Re: [DISCUSS] Moving GeoIP management away from MySQL
Date Tue, 17 Jan 2017 00:22:16 GMT
Sounds good!  And use a versioning scheme via subdirectories in HDFS, so you can revert back
if you want.

On 1/16/17, 4:11 PM, "Casey Stella" <cestella@gmail.com> wrote:

    I'd recommend storing the MM data location in HDFS in the global config.
    When the config property changes, then you know you need to reread the
    database from HDFS.  This would keep you from re-reading frequently.
    
    On Mon, Jan 16, 2017 at 18:45 Matt Foley <mattf@apache.org> wrote:
    
    > I agree too.  I confirmed the GeoIP2 Java API is ASF2.0 licensed, as you
    > all no doubt knew already.
    >
    > Just a couple comments and a question:
    >
    > First note that storing data in HDFS, while it avoids the deployment
    > question, also induces a network hop to read it.
    > Presumably that only happens once per update per geo bolt instance, but
    > how do you avoid re-reading it frequently, to make sure you see updates?
    >
    > Second, I just want to comment that there is not a single point of failure
    > for an enterprise db that has been properly set up for HA.  Granted that’s
    > neither here nor there if we don’t need a db, but it isn’t a valid argument
    > against using a db. :-)
    >
    > Thanks,
    > --Matt
    >
    > On 1/16/17, 1:36 PM, "Michael Miklavcic" <michael.miklavcic@gmail.com>
    > wrote:
    >
    >     I'm also in agreement on this.
    >
    >     On Mon, Jan 16, 2017 at 2:11 PM, Nick Allen <nick@nickallen.org>
    > wrote:
    >
    >     > +1 to using the Java API with the MMDB file provided by Maxmind.
    > This is
    >     > what I had thought we were doing when we discussed this a few months
    > back.
    >     > I'd rather use the Maxmind tools as-provided instead of engineering
    >     > something on top of it.
    >     >
    >     > On Mon, Jan 16, 2017 at 3:59 PM, JJ Meyer <jjmeyer0@gmail.com>
    > wrote:
    >     >
    >     > > Matt, I agree with your points on why we shouldn't just get rid of
    > the
    >     > > database just to get rid of a database. But IMO, I think we may be
    >     > > reinventing the wheel a little bit by even putting the maxmind
    > data into
    >     > > MySQL. Right now we are already downloading a maxmind file. To me
    > it
    >     > seems
    >     > > simpler to push the file to HDFS where we can pick it up and have
    > the
    >     > > maxmind client use that instead of importing data into a DB and
    > then
    >     > > running a query. Also, I believe the data gets updated weekly. So
    > syncing
    >     > > may become easier too.
    >     > >
    >     > > James, I believe it works with the paid and free versions of
    > geoip. I
    >     > know
    >     > > NiFi uses this client library in their Geo enrichment processor.
    >     > >
    >     > > Also, if it is decided that using a SQL database is still the best
    >     > > solution, I think there is a benefit to using their library. We
    > would
    >     > just
    >     > > have to implement a `DatabaseProvider` that hits some SQL db
    > instead of
    >     > > using their standard implementation.
    >     > >
    >     > > Thanks,
    >     > > JJ
    >     > >
    >     > > On Mon, Jan 16, 2017 at 2:27 PM, James Sirota <jsirota@apache.org>
    >     > wrote:
    >     > >
    >     > > > Hi Guys, I just wanted to clarify one point that I think is lost
    > in
    >     > this
    >     > > > tread.  Geo enrichment is NOT a key-value enrichment.  It
    > requires a
    >     > > range
    >     > > > scan and a join (which is why it's implemented via mySql and not
    >     > Hbase).
    >     > > > To account for this access pattern via a key-value store you
    > would
    >     > > > inevitably have to do something funky or in case of Hbase I
    > don't think
    >     > > > there is a way to avoid doing a range scan.
    >     > > >
    >     > > > With respect to mapdb it only has support for Maps, Sets, Lists,
    >     > Queues.
    >     > > > Are we sure it provides enough functionality for us to do this
    >     > > enrichment?
    >     > > >
    >     > > > With respect to the Maxmind client, are we sure we can use it
on
    > the
    >     > > > mySql-backed version of their DB?  I thought the Maxmind database
    >     > itself
    >     > > is
    >     > > > proprietary and is something you have to pay for.  My
    > understanding is
    >     > > that
    >     > > > the client is designed for that proprietary version.
    >     > > >
    >     > > > I somewhat agree with Matt's point.  If mySql is a problem
    > because of
    >     > > > licensing, the path of least resistance to remove mySql
    > dependencies
    >     > > would
    >     > > > be to simply switch to postgresql.  We will always have
    > conventional
    >     > sql
    >     > > > databases in our stack because other big data tools use them.
    > Why not
    >     > > take
    >     > > > advantage of them too?
    >     > > >
    >     > > > Thanks,
    >     > > > James
    >     > > >
    >     > > > 16.01.2017, 12:27, "Matt Foley" <mattf@apache.org>:
    >     > > > > Hi Justin, and team,
    >     > > > > Several components of the Hadoop Stack utilize a SQL database,
    >     > usually
    >     > > > for metadata of some sort. Ambari knows this and arranges for
    > them to
    >     > > share
    >     > > > a single database installation (on or off the cluster), unless
    > they
    >     > > > explicitly configure use of different databases (which is
    > allowed for
    >     > > sites
    >     > > > that desire it). Ambari defaults to using PostgreSQL, altho it’s
    > happy
    >     > to
    >     > > > use MySQL, Oracle, or Microsoft, along with whatever each
    > component
    >     > > > historically defined as their default (such as Derby).
    >     > > > >
    >     > > > > If we want to start with a replacement of current
    > functionality, I
    >     > > would
    >     > > > suggest switching the default database to PostgreSQL. Replacing
    > fast,
    >     > > > efficient, and proven db services with a file-based api library
    > (but no
    >     > > > standard way to propagate the underlying storage files) seems
to
    > me to
    >     > be
    >     > > > taking a step backwards.
    >     > > > >
    >     > > > > Sticking with a SQL-based service will surely minimize the
    > amount of
    >     > > > code changes needed. And making the SQL either
    > dialect-independent or
    >     > > > capable of switching among dialects, then enables us to do what
    > the
    >     > rest
    >     > > of
    >     > > > the Hadoop stack does: allow enterprise customers to substitute
    > Oracle
    >     > or
    >     > > > Microsoft enterprise-class databases where they wish. Regarding
    > the
    >     > > > drivers, we should study what the other Stack components do; I’m
    > not an
    >     > > > expert in those areas.
    >     > > > >
    >     > > > > Using the same db as the rest of the stack also means
    > administrators
    >     > > can
    >     > > > be confident they’ve set up adequate backup and recovery
    > processes.
    >     > > > > All these are valuable reasons not to roll our own storage
    > system for
    >     > > > this enrichment data. IMO, of course.
    >     > > > >
    >     > > > > Cheers,
    >     > > > > --Matt
    >     > > > >
    >     > > > > On 1/16/17, 9:52 AM, "Kyle Richardson" <
    > kylerichardson2@gmail.com>
    >     > > > wrote:
    >     > > > >
    >     > > > >     +1 Agree with David's order
    >     > > > >
    >     > > > >     -Kyle
    >     > > > >
    >     > > > >     On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <
    >     > dlyle65535@gmail.com
    >     > > >
    >     > > > wrote:
    >     > > > >
    >     > > > >     > Def agree on the parity point.
    >     > > > >     >
    >     > > > >     > I'm a little worried about Supervisor relocations
for
    > non-HBase
    >     > > > solutions,
    >     > > > >     > but having much of the work done for us by MaxMind
    > changes my
    >     > > > preference to
    >     > > > >     > (in order)
    >     > > > >     >
    >     > > > >     > 1) MM API
    >     > > > >     > 2) HBase Enrichment
    >     > > > >     > 3) MapDB should the others prove not feasible
    >     > > > >     >
    >     > > > >     >
    >     > > > >     > -D...
    >     > > > >     >
    >     > > > >     >
    >     > > > >     > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <
    >     > > > justinjleet@gmail.com>
    >     > > > >     > wrote:
    >     > > > >     >
    >     > > > >     > > I definitely agree on checking out the MaxMind
API.
    > I'll
    >     > take a
    >     > > > look at
    >     > > > >     > > it, but at first glance it looks like it does
include
    >     > > everything
    >     > > > we use.
    >     > > > >     > > Great find, JJ.
    >     > > > >     > >
    >     > > > >     > > More details on various people's points:
    >     > > > >     > >
    >     > > > >     > > As a note to anyone hopping in, Simon's point
on the
    > range
    >     > > > lookup vs a
    >     > > > >     > key
    >     > > > >     > > lookup is why it becomes a Scan in HBase vs
a Get. As
    > an
    >     > > > addendum to
    >     > > > >     > what
    >     > > > >     > > Simon mentioned, denormalizing is easy enough
and
    > turns it
    >     > into
    >     > > > an easy
    >     > > > >     > > range lookup.
    >     > > > >     > >
    >     > > > >     > > To David's point, the MapDB does require a
network
    > hop, but
    >     > > it's
    >     > > > once per
    >     > > > >     > > refresh of the data (Got a relevant callback?
Grab new
    > data,
    >     > > > load it,
    >     > > > >     > swap
    >     > > > >     > > out) instead of (up to) once per message. I
would
    > expect the
    >     > > > same to be
    >     > > > >     > > true of the MaxMind db files.
    >     > > > >     > >
    >     > > > >     > > I'd also argue MapDB not really more complex
than
    > refreshing
    >     > > the
    >     > > > HBase
    >     > > > >     > > table, because we potentially have to start
worrying
    > about
    >     > > > things like
    >     > > > >     > > hashing and/or indices and even just general
data
    >     > represtation.
    >     > > > It's
    >     > > > >     > > definitely correct that the file processing
has to
    > occur on
    >     > > > either path,
    >     > > > >     > so
    >     > > > >     > > it really boils down to handling the callback
and
    > reloading
    >     > the
    >     > > > file vs
    >     > > > >     > > handling some of the standard HBasey things.
I don't
    > think
    >     > > > either is an
    >     > > > >     > > enormous amount of work (and both are almost
certainly
    > more
    >     > > work
    >     > > > than
    >     > > > >     > > MaxMind's API)
    >     > > > >     > >
    >     > > > >     > > Regarding extensibility, I'd argue for parity
with
    > what we
    >     > have
    >     > > > first,
    >     > > > >     > then
    >     > > > >     > > build what we need from there. Does anybody
have any
    >     > > > disagreement with
    >     > > > >     > > that approach for right now?
    >     > > > >     > >
    >     > > > >     > > Justin
    >     > > > >     > >
    >     > > > >     > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle
<
    >     > > > dlyle65535@gmail.com>
    >     > > > >     > wrote:
    >     > > > >     > >
    >     > > > >     > > > It is interesting- save us a ton of effort,
and has
    > the
    >     > right
    >     > > > license.
    >     > > > >     > I
    >     > > > >     > > > think it's worth at least checking out.
    >     > > > >     > > >
    >     > > > >     > > > -D...
    >     > > > >     > > >
    >     > > > >     > > >
    >     > > > >     > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon
Elliston
    > Ball <
    >     > > > >     > > > simon@simonellistonball.com> wrote:
    >     > > > >     > > >
    >     > > > >     > > > > I like that approach even more. That
way we would
    > only
    >     > have
    >     > > > to worry
    >     > > > >     > > > about
    >     > > > >     > > > > distributing the database file in
binary format to
    > all
    >     > the
    >     > > > supervisor
    >     > > > >     > > > nodes
    >     > > > >     > > > > on update.
    >     > > > >     > > > >
    >     > > > >     > > > > It would also make it easier for
people to switch
    > to the
    >     > > > enterprise
    >     > > > >     > DB
    >     > > > >     > > > > potentially if they had the license.
    >     > > > >     > > > >
    >     > > > >     > > > > One slight issue with this might
be for people who
    > wanted
    >     > > to
    >     > > > extend
    >     > > > >     > the
    >     > > > >     > > > > database. For example, organisations
may want to
    > add
    >     > > > geo-enrichment
    >     > > > >     > to
    >     > > > >     > > > > their own private network addresses
based modified
    >     > versions
    >     > > > of the
    >     > > > >     > geo
    >     > > > >     > > > > database. Currently we don’t really
allow this,
    > since we
    >     > > > hard-code
    >     > > > >     > > > ignoring
    >     > > > >     > > > > private network classes into the
geo enrichment
    > adapter,
    >     > > but
    >     > > > I can
    >     > > > >     > see
    >     > > > >     > > a
    >     > > > >     > > > > case where a global org might want
to add their own
    >     > ranges
    >     > > > and
    >     > > > >     > > locations
    >     > > > >     > > > to
    >     > > > >     > > > > the data set. Does that make sense
to anyone else?
    >     > > > >     > > > >
    >     > > > >     > > > > Simon
    >     > > > >     > > > >
    >     > > > >     > > > >
    >     > > > >     > > > > > On 16 Jan 2017, at 16:50, JJ
Meyer <
    > jjmeyer0@gmail.com
    >     > >
    >     > > > wrote:
    >     > > > >     > > > > >
    >     > > > >     > > > > > Hello all,
    >     > > > >     > > > > >
    >     > > > >     > > > > > Can we leverage maxmind's Java
client (
    >     > > > >     > > > > > https://github.com/maxmind/
    >     > GeoIP2-java/tree/master/src/
    >     > > > >     > > > > main/java/com/maxmind/geoip2)
    >     > > > >     > > > > > in this case? I believe it can
directly read
    > maxmind
    >     > > file.
    >     > > > Plus I
    >     > > > >     > > think
    >     > > > >     > > > > it
    >     > > > >     > > > > > also has some support for caching
as well.
    >     > > > >     > > > > >
    >     > > > >     > > > > > Thanks,
    >     > > > >     > > > > > JJ
    >     > > > >     > > > > >
    >     > > > >     > > > > > On Mon, Jan 16, 2017 at 10:32
AM, Simon Elliston
    > Ball <
    >     > > > >     > > > > > simon@simonellistonball.com>
wrote:
    >     > > > >     > > > > >
    >     > > > >     > > > > >> I like the idea of MapDB,
since we can
    > essentially
    >     > pull
    >     > > an
    >     > > > >     > instance
    >     > > > >     > > > into
    >     > > > >     > > > > >> each supervisor, so it makes
a lot of sense for
    >     > > > relatively small
    >     > > > >     > > > scale,
    >     > > > >     > > > > >> relatively static enrichments
in general.
    >     > > > >     > > > > >>
    >     > > > >     > > > > >> Generally this feels like
a caching problem,
    > and would
    >     > > be
    >     > > > for a
    >     > > > >     > > simple
    >     > > > >     > > > > >> key-value lookup. In that
case I would agree
    > with
    >     > David
    >     > > > Lyle on
    >     > > > >     > > using
    >     > > > >     > > > > HBase
    >     > > > >     > > > > >> as a source or truth and
relying on caching.
    >     > > > >     > > > > >>
    >     > > > >     > > > > >> That said, GeoIP is a different
lookup pattern,
    > since
    >     > > > it’s a range
    >     > > > >     > > > > lookup
    >     > > > >     > > > > >> then a key lookup (or if
we denormalize the
    > MaxMind
    >     > > data,
    >     > > > just a
    >     > > > >     > > range
    >     > > > >     > > > > >> lookup) for that kind of
thing MapDB with
    > something
    >     > like
    >     > > > the BTree
    >     > > > >     > > > > seems a
    >     > > > >     > > > > >> good fit.
    >     > > > >     > > > > >>
    >     > > > >     > > > > >> Simon
    >     > > > >     > > > > >>
    >     > > > >     > > > > >>
    >     > > > >     > > > > >>> On 16 Jan 2017, at 16:28,
David Lyle <
    >     > > > dlyle65535@gmail.com>
    >     > > > >     > wrote:
    >     > > > >     > > > > >>>
    >     > > > >     > > > > >>> I'm +1 on removing the
MySQL dependency, BUT -
    > I'd
    >     > > > prefer to see
    >     > > > >     > it
    >     > > > >     > > > as
    >     > > > >     > > > > an
    >     > > > >     > > > > >>> HBase enrichment. If
our current caching isn't
    > enough
    >     > > to
    >     > > > mitigate
    >     > > > >     > > the
    >     > > > >     > > > > >> above
    >     > > > >     > > > > >>> issues, we have a problem,
don't we? Or do we
    > not
    >     > > > recommend HBase
    >     > > > >     > > > > >>> enrichment for per message
enrichment in
    > general?
    >     > > > >     > > > > >>>
    >     > > > >     > > > > >>> Also- can you elaborate
on how MapDB would not
    >     > require
    >     > > a
    >     > > > network
    >     > > > >     > > hop?
    >     > > > >     > > > > >>> Doesn't this mean we
would have to sync the
    >     > enrichment
    >     > > > data to
    >     > > > >     > each
    >     > > > >     > > > > Storm
    >     > > > >     > > > > >>> supervisor? HDFS could
(probably would) have a
    >     > network
    >     > > > hop too,
    >     > > > >     > no?
    >     > > > >     > > > > >>>
    >     > > > >     > > > > >>> Fwiw -
    >     > > > >     > > > > >>> "In its place, I've
looked at using MapDB,
    > which is a
    >     > > > really easy
    >     > > > >     > > to
    >     > > > >     > > > > use
    >     > > > >     > > > > >>> library for creating
Java collections backed
    > by a
    >     > file
    >     > > > (This is
    >     > > > >     > > NOT a
    >     > > > >     > > > > >>> separate installation
of anything, it's just a
    > jar
    >     > that
    >     > > > manages
    >     > > > >     > > > > >> interaction
    >     > > > >     > > > > >>> with the file system).
Given the slow churn of
    > the
    >     > > GeoIP
    >     > > > files
    >     > > > >     > (I
    >     > > > >     > > > > >> believe
    >     > > > >     > > > > >>> they get updated once
a week), we can have a
    > script
    >     > > that
    >     > > > can be
    >     > > > >     > run
    >     > > > >     > > > > when
    >     > > > >     > > > > >>> needed, downloads the
MaxMind tar file, builds
    > the
    >     > > MapDB
    >     > > > file
    >     > > > >     > that
    >     > > > >     > > > will
    >     > > > >     > > > > >> be
    >     > > > >     > > > > >>> used by the bolts, and
places it into HDFS.
    > Finally,
    >     > we
    >     > > > update a
    >     > > > >     > > > > config
    >     > > > >     > > > > >> to
    >     > > > >     > > > > >>> point to the new file,
the bolts get the
    > updated
    >     > config
    >     > > > callback
    >     > > > >     > > and
    >     > > > >     > > > > can
    >     > > > >     > > > > >>> update their db files.
Inside the code, we
    > wrap the
    >     > > MapDB
    >     > > > >     > portions
    >     > > > >     > > > to
    >     > > > >     > > > > >> make
    >     > > > >     > > > > >>> it transparent to downstream
code."
    >     > > > >     > > > > >>>
    >     > > > >     > > > > >>> Seems a bit more complex
than "refresh the
    > hbase
    >     > > table".
    >     > > > Afaik,
    >     > > > >     > > > either
    >     > > > >     > > > > >>> approach would require
some sort of translation
    >     > between
    >     > > > GeoIP
    >     > > > >     > > source
    >     > > > >     > > > > >> format
    >     > > > >     > > > > >>> and target format, so
that part is a wash imo.
    >     > > > >     > > > > >>>
    >     > > > >     > > > > >>> So, I'd really like
to see, at least, an
    > attempt to
    >     > > > leverage
    >     > > > >     > HBase
    >     > > > >     > > > > >>> enrichment.
    >     > > > >     > > > > >>>
    >     > > > >     > > > > >>> -D...
    >     > > > >     > > > > >>>
    >     > > > >     > > > > >>>
    >     > > > >     > > > > >>> On Mon, Jan 16, 2017
at 11:02 AM, Casey Stella
    > <
    >     > > > >     > cestella@gmail.com
    >     > > > >     > > >
    >     > > > >     > > > > >> wrote:
    >     > > > >     > > > > >>>
    >     > > > >     > > > > >>>> I think that it's
a sensible thing to use
    > MapDB for
    >     > > the
    >     > > > geo
    >     > > > >     > > > > enrichment.
    >     > > > >     > > > > >>>> Let me state my
reasoning:
    >     > > > >     > > > > >>>>
    >     > > > >     > > > > >>>> - An HBase implementation
would necessitate a
    > HBase
    >     > > scan
    >     > > > >     > > possibly
    >     > > > >     > > > > >>>> hitting HDFS, which
is expensive per-message.
    >     > > > >     > > > > >>>> - An HBase implementation
would necessitate a
    >     > network
    >     > > > hop and
    >     > > > >     > > MapDB
    >     > > > >     > > > > >>>> would not.
    >     > > > >     > > > > >>>>
    >     > > > >     > > > > >>>> I also think this
might be the beginning of a
    > more
    >     > > > general
    >     > > > >     > purpose
    >     > > > >     > > > > >> support
    >     > > > >     > > > > >>>> in Stellar for locally
shipped, read-only
    > MapDB
    >     > > > lookups, which
    >     > > > >     > > might
    >     > > > >     > > > > be
    >     > > > >     > > > > >>>> interesting.
    >     > > > >     > > > > >>>>
    >     > > > >     > > > > >>>> In short, all quotes
about premature
    > optimization
    >     > are
    >     > > > sure to
    >     > > > >     > > apply
    >     > > > >     > > > to
    >     > > > >     > > > > >> my
    >     > > > >     > > > > >>>> reasoning, but I
can't help but have my spidey
    >     > senses
    >     > > > tingle
    >     > > > >     > when
    >     > > > >     > > we
    >     > > > >     > > > > >>>> introduce a scan-per-message
architecture.
    >     > > > >     > > > > >>>>
    >     > > > >     > > > > >>>> Casey
    >     > > > >     > > > > >>>>
    >     > > > >     > > > > >>>> On Mon, Jan 16,
2017 at 10:53 AM, Dima
    > Kovalyov <
    >     > > > >     > > > > >> Dima.Kovalyov@sstech.us>
    >     > > > >     > > > > >>>> wrote:
    >     > > > >     > > > > >>>>
    >     > > > >     > > > > >>>>> Hello Justin,
    >     > > > >     > > > > >>>>>
    >     > > > >     > > > > >>>>> Considering
that Metron uses hbase tables for
    >     > storing
    >     > > > >     > enrichment
    >     > > > >     > > > and
    >     > > > >     > > > > >>>>> threatintel
feeds, can we use Hbase for geo
    >     > > enrichment
    >     > > > as well?
    >     > > > >     > > > > >>>>> Or MapDB can
be used for enrichment and
    > threatintel
    >     > > > feeds
    >     > > > >     > instead
    >     > > > >     > > > of
    >     > > > >     > > > > >>>> hbase?
    >     > > > >     > > > > >>>>>
    >     > > > >     > > > > >>>>> - Dima
    >     > > > >     > > > > >>>>>
    >     > > > >     > > > > >>>>> On 01/16/2017
04:17 PM, Justin Leet wrote:
    >     > > > >     > > > > >>>>>> Hi all,
    >     > > > >     > > > > >>>>>>
    >     > > > >     > > > > >>>>>> As a bit
of background, right now, GeoIP
    > data is
    >     > > > loaded into
    >     > > > >     > and
    >     > > > >     > > > > >>>> managed
    >     > > > >     > > > > >>>>> by
    >     > > > >     > > > > >>>>>> MySQL (the
connectors are LGPL licensed and
    > we
    >     > need
    >     > > > to sever
    >     > > > >     > our
    >     > > > >     > > > > Maven
    >     > > > >     > > > > >>>>>> dependency
on it before next release). We
    >     > currently
    >     > > > depend on
    >     > > > >     > > and
    >     > > > >     > > > > >>>> install
    >     > > > >     > > > > >>>>>> an instance
of MySQL (in each of the
    > Management
    >     > > Pack,
    >     > > > Ansible,
    >     > > > >     > > and
    >     > > > >     > > > > >>>> Docker
    >     > > > >     > > > > >>>>>> installs).
In the topology, we use the
    > JDBCAdapter
    >     > > to
    >     > > > connect
    >     > > > >     > to
    >     > > > >     > > > > MySQL
    >     > > > >     > > > > >>>>> and
    >     > > > >     > > > > >>>>>> query for
a given IP. Additionally, it's a
    > single
    >     > > > point of
    >     > > > >     > > > failure
    >     > > > >     > > > > >> for
    >     > > > >     > > > > >>>>>> that particular
enrichment right now. If
    > MySQL is
    >     > > > down, geo
    >     > > > >     > > > > >> enrichment
    >     > > > >     > > > > >>>>>> can't occur.
    >     > > > >     > > > > >>>>>>
    >     > > > >     > > > > >>>>>> I'm proposing
that we eliminate the use of
    > MySQL
    >     > > > entirely,
    >     > > > >     > > through
    >     > > > >     > > > > all
    >     > > > >     > > > > >>>>>> installation
paths (which, unless I missed
    > some,
    >     > > > includes
    >     > > > >     > > Ansible,
    >     > > > >     > > > > the
    >     > > > >     > > > > >>>>>> Ambari Management
Pack, and Docker). We'd
    > do this
    >     > by
    >     > > > dropping
    >     > > > >     > > all
    >     > > > >     > > > > the
    >     > > > >     > > > > >>>>>> various
MySQL setup and management through
    > the
    >     > code,
    >     > > > along
    >     > > > >     > with
    >     > > > >     > > > all
    >     > > > >     > > > > >> the
    >     > > > >     > > > > >>>>>> DDL, etc.
The JDBCAdapter would stay, so
    > that
    >     > > anybody
    >     > > > who
    >     > > > >     > wants
    >     > > > >     > > > to
    >     > > > >     > > > > >>>> setup
    >     > > > >     > > > > >>>>>> their own
databases for enrichments and
    > install
    >     > > > connectors is
    >     > > > >     > > able
    >     > > > >     > > > > to
    >     > > > >     > > > > >>>> do
    >     > > > >     > > > > >>>>> so.
    >     > > > >     > > > > >>>>>>
    >     > > > >     > > > > >>>>>> In its place,
I've looked at using MapDB,
    > which
    >     > is a
    >     > > > really
    >     > > > >     > easy
    >     > > > >     > > > to
    >     > > > >     > > > > >> use
    >     > > > >     > > > > >>>>>> library
for creating Java collections
    > backed by a
    >     > > > file (This
    >     > > > >     > is
    >     > > > >     > > > NOT
    >     > > > >     > > > > a
    >     > > > >     > > > > >>>>>> separate
installation of anything, it's
    > just a jar
    >     > > > that
    >     > > > >     > manages
    >     > > > >     > > > > >>>>> interaction
    >     > > > >     > > > > >>>>>> with the
file system). Given the slow churn
    > of the
    >     > > > GeoIP
    >     > > > >     > files
    >     > > > >     > > (I
    >     > > > >     > > > > >>>>> believe
    >     > > > >     > > > > >>>>>> they get
updated once a week), we can have a
    >     > script
    >     > > > that can
    >     > > > >     > be
    >     > > > >     > > > run
    >     > > > >     > > > > >>>> when
    >     > > > >     > > > > >>>>>> needed,
downloads the MaxMind tar file,
    > builds the
    >     > > > MapDB file
    >     > > > >     > > that
    >     > > > >     > > > > >> will
    >     > > > >     > > > > >>>>> be
    >     > > > >     > > > > >>>>>> used by
the bolts, and places it into HDFS.
    >     > Finally,
    >     > > > we
    >     > > > >     > update
    >     > > > >     > > a
    >     > > > >     > > > > >>>> config
    >     > > > >     > > > > >>>>> to
    >     > > > >     > > > > >>>>>> point to
the new file, the bolts get the
    > updated
    >     > > > config
    >     > > > >     > callback
    >     > > > >     > > > and
    >     > > > >     > > > > >>>> can
    >     > > > >     > > > > >>>>>> update their
db files. Inside the code, we
    > wrap
    >     > the
    >     > > > MapDB
    >     > > > >     > > > portions
    >     > > > >     > > > > to
    >     > > > >     > > > > >>>>> make
    >     > > > >     > > > > >>>>>> it transparent
to downstream code.
    >     > > > >     > > > > >>>>>>
    >     > > > >     > > > > >>>>>> The particularly
nice parts about using
    > MapDB are
    >     > > > that its
    >     > > > >     > ease
    >     > > > >     > > of
    >     > > > >     > > > > use
    >     > > > >     > > > > >>>>> plus
    >     > > > >     > > > > >>>>>> it offers
the utilities we need out of the
    > box to
    >     > be
    >     > > > able to
    >     > > > >     > > > support
    >     > > > >     > > > > >>>> the
    >     > > > >     > > > > >>>>>> operations
we need on this (Keep in mind
    > the GeoIP
    >     > > > files use
    >     > > > >     > IP
    >     > > > >     > > > > ranges
    >     > > > >     > > > > >>>>> and
    >     > > > >     > > > > >>>>>> we need
to be able to easily grab the
    > appropriate
    >     > > > range).
    >     > > > >     > > > > >>>>>>
    >     > > > >     > > > > >>>>>> The main
point of concern I have about this
    > is
    >     > that
    >     > > > when we
    >     > > > >     > grab
    >     > > > >     > > > the
    >     > > > >     > > > > >>>> HDFS
    >     > > > >     > > > > >>>>>> file during
an update, given that multiple
    > JVMs
    >     > can
    >     > > be
    >     > > > >     > running,
    >     > > > >     > > we
    >     > > > >     > > > > >>>> don't
    >     > > > >     > > > > >>>>>> want them
to clobber each other. I believe
    > this
    >     > can
    >     > > > be avoided
    >     > > > >     > > by
    >     > > > >     > > > > >>>> simply
    >     > > > >     > > > > >>>>>> using each
worker's working directory to
    > store the
    >     > > > file (and
    >     > > > >     > > > > >>>>> appropriately
    >     > > > >     > > > > >>>>>> ensure threads
on the same JVM manage
    >     > > > multithreading). This
    >     > > > >     > > > should
    >     > > > >     > > > > >>>> keep
    >     > > > >     > > > > >>>>>> the JVMs
(and the underlying DB files)
    > entirely
    >     > > > independent.
    >     > > > >     > > > > >>>>>>
    >     > > > >     > > > > >>>>>> This script
would get called by the various
    >     > > > installations
    >     > > > >     > during
    >     > > > >     > > > > >>>> startup
    >     > > > >     > > > > >>>>> to
    >     > > > >     > > > > >>>>>> do the initial
setup. After install, it can
    > then
    >     > be
    >     > > > called on
    >     > > > >     > > > > demand
    >     > > > >     > > > > >>>> in
    >     > > > >     > > > > >>>>>> order.
    >     > > > >     > > > > >>>>>>
    >     > > > >     > > > > >>>>>> At this
point, we should be all set, with
    >     > everything
    >     > > > running
    >     > > > >     > and
    >     > > > >     > > > > >>>>> updatable.
    >     > > > >     > > > > >>>>>>
    >     > > > >     > > > > >>>>>> Justin
    >     > > > >     > > > > >>>>>>
    >     > > > >     > > > > >>>>>
    >     > > > >     > > > > >>>>>
    >     > > > >     > > > > >>>>
    >     > > > >     > > > > >>
    >     > > > >     > > > > >>
    >     > > > >     > > > >
    >     > > > >     > > > >
    >     > > > >     > > >
    >     > > > >     > >
    >     > > > >     >
    >     > > >
    >     > > > -------------------
    >     > > > Thank you,
    >     > > >
    >     > > > James Sirota
    >     > > > PPMC- Apache Metron (Incubating)
    >     > > > jsirota AT apache DOT org
    >     > > >
    >     > >
    >     >
    >     >
    >     >
    >     > --
    >     > Nick Allen <nick@nickallen.org>
    >     >
    >
    >
    >
    >
    



Mime
View raw message