metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject Re: [DISCUSS] Moving GeoIP management away from MySQL
Date Tue, 17 Jan 2017 00:22:53 GMT
Yep, just what I was thinking

On Mon, Jan 16, 2017 at 7:22 PM, Matt Foley <mattf@apache.org> wrote:

> Sounds good!  And use a versioning scheme via subdirectories in HDFS, so
> you can revert back if you want.
>
> On 1/16/17, 4:11 PM, "Casey Stella" <cestella@gmail.com> wrote:
>
>     I'd recommend storing the MM data location in HDFS in the global
> config.
>     When the config property changes, then you know you need to reread the
>     database from HDFS.  This would keep you from re-reading frequently.
>
>     On Mon, Jan 16, 2017 at 18:45 Matt Foley <mattf@apache.org> wrote:
>
>     > I agree too.  I confirmed the GeoIP2 Java API is ASF2.0 licensed, as
> you
>     > all no doubt knew already.
>     >
>     > Just a couple comments and a question:
>     >
>     > First note that storing data in HDFS, while it avoids the deployment
>     > question, also induces a network hop to read it.
>     > Presumably that only happens once per update per geo bolt instance,
> but
>     > how do you avoid re-reading it frequently, to make sure you see
> updates?
>     >
>     > Second, I just want to comment that there is not a single point of
> failure
>     > for an enterprise db that has been properly set up for HA.  Granted
> that’s
>     > neither here nor there if we don’t need a db, but it isn’t a valid
> argument
>     > against using a db. :-)
>     >
>     > Thanks,
>     > --Matt
>     >
>     > On 1/16/17, 1:36 PM, "Michael Miklavcic" <
> michael.miklavcic@gmail.com>
>     > wrote:
>     >
>     >     I'm also in agreement on this.
>     >
>     >     On Mon, Jan 16, 2017 at 2:11 PM, Nick Allen <nick@nickallen.org>
>     > wrote:
>     >
>     >     > +1 to using the Java API with the MMDB file provided by
> Maxmind.
>     > This is
>     >     > what I had thought we were doing when we discussed this a few
> months
>     > back.
>     >     > I'd rather use the Maxmind tools as-provided instead of
> engineering
>     >     > something on top of it.
>     >     >
>     >     > On Mon, Jan 16, 2017 at 3:59 PM, JJ Meyer <jjmeyer0@gmail.com>
>     > wrote:
>     >     >
>     >     > > Matt, I agree with your points on why we shouldn't just get
> rid of
>     > the
>     >     > > database just to get rid of a database. But IMO, I think we
> may be
>     >     > > reinventing the wheel a little bit by even putting the
> maxmind
>     > data into
>     >     > > MySQL. Right now we are already downloading a maxmind file.
> To me
>     > it
>     >     > seems
>     >     > > simpler to push the file to HDFS where we can pick it up and
> have
>     > the
>     >     > > maxmind client use that instead of importing data into a DB
> and
>     > then
>     >     > > running a query. Also, I believe the data gets updated
> weekly. So
>     > syncing
>     >     > > may become easier too.
>     >     > >
>     >     > > James, I believe it works with the paid and free versions of
>     > geoip. I
>     >     > know
>     >     > > NiFi uses this client library in their Geo enrichment
> processor.
>     >     > >
>     >     > > Also, if it is decided that using a SQL database is still
> the best
>     >     > > solution, I think there is a benefit to using their library.
> We
>     > would
>     >     > just
>     >     > > have to implement a `DatabaseProvider` that hits some SQL db
>     > instead of
>     >     > > using their standard implementation.
>     >     > >
>     >     > > Thanks,
>     >     > > JJ
>     >     > >
>     >     > > On Mon, Jan 16, 2017 at 2:27 PM, James Sirota <
> jsirota@apache.org>
>     >     > wrote:
>     >     > >
>     >     > > > Hi Guys, I just wanted to clarify one point that I think
> is lost
>     > in
>     >     > this
>     >     > > > tread.  Geo enrichment is NOT a key-value enrichment.  It
>     > requires a
>     >     > > range
>     >     > > > scan and a join (which is why it's implemented via mySql
> and not
>     >     > Hbase).
>     >     > > > To account for this access pattern via a key-value store
> you
>     > would
>     >     > > > inevitably have to do something funky or in case of Hbase I
>     > don't think
>     >     > > > there is a way to avoid doing a range scan.
>     >     > > >
>     >     > > > With respect to mapdb it only has support for Maps, Sets,
> Lists,
>     >     > Queues.
>     >     > > > Are we sure it provides enough functionality for us to do
> this
>     >     > > enrichment?
>     >     > > >
>     >     > > > With respect to the Maxmind client, are we sure we can use
> it on
>     > the
>     >     > > > mySql-backed version of their DB?  I thought the Maxmind
> database
>     >     > itself
>     >     > > is
>     >     > > > proprietary and is something you have to pay for.  My
>     > understanding is
>     >     > > that
>     >     > > > the client is designed for that proprietary version.
>     >     > > >
>     >     > > > I somewhat agree with Matt's point.  If mySql is a problem
>     > because of
>     >     > > > licensing, the path of least resistance to remove mySql
>     > dependencies
>     >     > > would
>     >     > > > be to simply switch to postgresql.  We will always have
>     > conventional
>     >     > sql
>     >     > > > databases in our stack because other big data tools use
> them.
>     > Why not
>     >     > > take
>     >     > > > advantage of them too?
>     >     > > >
>     >     > > > Thanks,
>     >     > > > James
>     >     > > >
>     >     > > > 16.01.2017, 12:27, "Matt Foley" <mattf@apache.org>:
>     >     > > > > Hi Justin, and team,
>     >     > > > > Several components of the Hadoop Stack utilize a SQL
> database,
>     >     > usually
>     >     > > > for metadata of some sort. Ambari knows this and arranges
> for
>     > them to
>     >     > > share
>     >     > > > a single database installation (on or off the cluster),
> unless
>     > they
>     >     > > > explicitly configure use of different databases (which is
>     > allowed for
>     >     > > sites
>     >     > > > that desire it). Ambari defaults to using PostgreSQL,
> altho it’s
>     > happy
>     >     > to
>     >     > > > use MySQL, Oracle, or Microsoft, along with whatever each
>     > component
>     >     > > > historically defined as their default (such as Derby).
>     >     > > > >
>     >     > > > > If we want to start with a replacement of current
>     > functionality, I
>     >     > > would
>     >     > > > suggest switching the default database to PostgreSQL.
> Replacing
>     > fast,
>     >     > > > efficient, and proven db services with a file-based api
> library
>     > (but no
>     >     > > > standard way to propagate the underlying storage files)
> seems to
>     > me to
>     >     > be
>     >     > > > taking a step backwards.
>     >     > > > >
>     >     > > > > Sticking with a SQL-based service will surely minimize
> the
>     > amount of
>     >     > > > code changes needed. And making the SQL either
>     > dialect-independent or
>     >     > > > capable of switching among dialects, then enables us to do
> what
>     > the
>     >     > rest
>     >     > > of
>     >     > > > the Hadoop stack does: allow enterprise customers to
> substitute
>     > Oracle
>     >     > or
>     >     > > > Microsoft enterprise-class databases where they wish.
> Regarding
>     > the
>     >     > > > drivers, we should study what the other Stack components
> do; I’m
>     > not an
>     >     > > > expert in those areas.
>     >     > > > >
>     >     > > > > Using the same db as the rest of the stack also means
>     > administrators
>     >     > > can
>     >     > > > be confident they’ve set up adequate backup and recovery
>     > processes.
>     >     > > > > All these are valuable reasons not to roll our own
> storage
>     > system for
>     >     > > > this enrichment data. IMO, of course.
>     >     > > > >
>     >     > > > > Cheers,
>     >     > > > > --Matt
>     >     > > > >
>     >     > > > > On 1/16/17, 9:52 AM, "Kyle Richardson" <
>     > kylerichardson2@gmail.com>
>     >     > > > wrote:
>     >     > > > >
>     >     > > > >     +1 Agree with David's order
>     >     > > > >
>     >     > > > >     -Kyle
>     >     > > > >
>     >     > > > >     On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <
>     >     > dlyle65535@gmail.com
>     >     > > >
>     >     > > > wrote:
>     >     > > > >
>     >     > > > >     > Def agree on the parity point.
>     >     > > > >     >
>     >     > > > >     > I'm a little worried about Supervisor relocations
> for
>     > non-HBase
>     >     > > > solutions,
>     >     > > > >     > but having much of the work done for us by MaxMind
>     > changes my
>     >     > > > preference to
>     >     > > > >     > (in order)
>     >     > > > >     >
>     >     > > > >     > 1) MM API
>     >     > > > >     > 2) HBase Enrichment
>     >     > > > >     > 3) MapDB should the others prove not feasible
>     >     > > > >     >
>     >     > > > >     >
>     >     > > > >     > -D...
>     >     > > > >     >
>     >     > > > >     >
>     >     > > > >     > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <
>     >     > > > justinjleet@gmail.com>
>     >     > > > >     > wrote:
>     >     > > > >     >
>     >     > > > >     > > I definitely agree on checking out the MaxMind
> API.
>     > I'll
>     >     > take a
>     >     > > > look at
>     >     > > > >     > > it, but at first glance it looks like it does
> include
>     >     > > everything
>     >     > > > we use.
>     >     > > > >     > > Great find, JJ.
>     >     > > > >     > >
>     >     > > > >     > > More details on various people's points:
>     >     > > > >     > >
>     >     > > > >     > > As a note to anyone hopping in, Simon's point on
> the
>     > range
>     >     > > > lookup vs a
>     >     > > > >     > key
>     >     > > > >     > > lookup is why it becomes a Scan in HBase vs a
> Get. As
>     > an
>     >     > > > addendum to
>     >     > > > >     > what
>     >     > > > >     > > Simon mentioned, denormalizing is easy enough and
>     > turns it
>     >     > into
>     >     > > > an easy
>     >     > > > >     > > range lookup.
>     >     > > > >     > >
>     >     > > > >     > > To David's point, the MapDB does require a
> network
>     > hop, but
>     >     > > it's
>     >     > > > once per
>     >     > > > >     > > refresh of the data (Got a relevant callback?
> Grab new
>     > data,
>     >     > > > load it,
>     >     > > > >     > swap
>     >     > > > >     > > out) instead of (up to) once per message. I would
>     > expect the
>     >     > > > same to be
>     >     > > > >     > > true of the MaxMind db files.
>     >     > > > >     > >
>     >     > > > >     > > I'd also argue MapDB not really more complex than
>     > refreshing
>     >     > > the
>     >     > > > HBase
>     >     > > > >     > > table, because we potentially have to start
> worrying
>     > about
>     >     > > > things like
>     >     > > > >     > > hashing and/or indices and even just general data
>     >     > represtation.
>     >     > > > It's
>     >     > > > >     > > definitely correct that the file processing has
> to
>     > occur on
>     >     > > > either path,
>     >     > > > >     > so
>     >     > > > >     > > it really boils down to handling the callback and
>     > reloading
>     >     > the
>     >     > > > file vs
>     >     > > > >     > > handling some of the standard HBasey things. I
> don't
>     > think
>     >     > > > either is an
>     >     > > > >     > > enormous amount of work (and both are almost
> certainly
>     > more
>     >     > > work
>     >     > > > than
>     >     > > > >     > > MaxMind's API)
>     >     > > > >     > >
>     >     > > > >     > > Regarding extensibility, I'd argue for parity
> with
>     > what we
>     >     > have
>     >     > > > first,
>     >     > > > >     > then
>     >     > > > >     > > build what we need from there. Does anybody have
> any
>     >     > > > disagreement with
>     >     > > > >     > > that approach for right now?
>     >     > > > >     > >
>     >     > > > >     > > Justin
>     >     > > > >     > >
>     >     > > > >     > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <
>     >     > > > dlyle65535@gmail.com>
>     >     > > > >     > wrote:
>     >     > > > >     > >
>     >     > > > >     > > > It is interesting- save us a ton of effort,
> and has
>     > the
>     >     > right
>     >     > > > license.
>     >     > > > >     > I
>     >     > > > >     > > > think it's worth at least checking out.
>     >     > > > >     > > >
>     >     > > > >     > > > -D...
>     >     > > > >     > > >
>     >     > > > >     > > >
>     >     > > > >     > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon
> Elliston
>     > Ball <
>     >     > > > >     > > > simon@simonellistonball.com> wrote:
>     >     > > > >     > > >
>     >     > > > >     > > > > I like that approach even more. That way we
> would
>     > only
>     >     > have
>     >     > > > to worry
>     >     > > > >     > > > about
>     >     > > > >     > > > > distributing the database file in binary
> format to
>     > all
>     >     > the
>     >     > > > supervisor
>     >     > > > >     > > > nodes
>     >     > > > >     > > > > on update.
>     >     > > > >     > > > >
>     >     > > > >     > > > > It would also make it easier for people to
> switch
>     > to the
>     >     > > > enterprise
>     >     > > > >     > DB
>     >     > > > >     > > > > potentially if they had the license.
>     >     > > > >     > > > >
>     >     > > > >     > > > > One slight issue with this might be for
> people who
>     > wanted
>     >     > > to
>     >     > > > extend
>     >     > > > >     > the
>     >     > > > >     > > > > database. For example, organisations may
> want to
>     > add
>     >     > > > geo-enrichment
>     >     > > > >     > to
>     >     > > > >     > > > > their own private network addresses based
> modified
>     >     > versions
>     >     > > > of the
>     >     > > > >     > geo
>     >     > > > >     > > > > database. Currently we don’t really allow
> this,
>     > since we
>     >     > > > hard-code
>     >     > > > >     > > > ignoring
>     >     > > > >     > > > > private network classes into the geo
> enrichment
>     > adapter,
>     >     > > but
>     >     > > > I can
>     >     > > > >     > see
>     >     > > > >     > > a
>     >     > > > >     > > > > case where a global org might want to add
> their own
>     >     > ranges
>     >     > > > and
>     >     > > > >     > > locations
>     >     > > > >     > > > to
>     >     > > > >     > > > > the data set. Does that make sense to anyone
> else?
>     >     > > > >     > > > >
>     >     > > > >     > > > > Simon
>     >     > > > >     > > > >
>     >     > > > >     > > > >
>     >     > > > >     > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <
>     > jjmeyer0@gmail.com
>     >     > >
>     >     > > > wrote:
>     >     > > > >     > > > > >
>     >     > > > >     > > > > > Hello all,
>     >     > > > >     > > > > >
>     >     > > > >     > > > > > Can we leverage maxmind's Java client (
>     >     > > > >     > > > > > https://github.com/maxmind/
>     >     > GeoIP2-java/tree/master/src/
>     >     > > > >     > > > > main/java/com/maxmind/geoip2)
>     >     > > > >     > > > > > in this case? I believe it can directly
> read
>     > maxmind
>     >     > > file.
>     >     > > > Plus I
>     >     > > > >     > > think
>     >     > > > >     > > > > it
>     >     > > > >     > > > > > also has some support for caching as well.
>     >     > > > >     > > > > >
>     >     > > > >     > > > > > Thanks,
>     >     > > > >     > > > > > JJ
>     >     > > > >     > > > > >
>     >     > > > >     > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon
> Elliston
>     > Ball <
>     >     > > > >     > > > > > simon@simonellistonball.com> wrote:
>     >     > > > >     > > > > >
>     >     > > > >     > > > > >> I like the idea of MapDB, since we can
>     > essentially
>     >     > pull
>     >     > > an
>     >     > > > >     > instance
>     >     > > > >     > > > into
>     >     > > > >     > > > > >> each supervisor, so it makes a lot of
> sense for
>     >     > > > relatively small
>     >     > > > >     > > > scale,
>     >     > > > >     > > > > >> relatively static enrichments in general.
>     >     > > > >     > > > > >>
>     >     > > > >     > > > > >> Generally this feels like a caching
> problem,
>     > and would
>     >     > > be
>     >     > > > for a
>     >     > > > >     > > simple
>     >     > > > >     > > > > >> key-value lookup. In that case I would
> agree
>     > with
>     >     > David
>     >     > > > Lyle on
>     >     > > > >     > > using
>     >     > > > >     > > > > HBase
>     >     > > > >     > > > > >> as a source or truth and relying on
> caching.
>     >     > > > >     > > > > >>
>     >     > > > >     > > > > >> That said, GeoIP is a different lookup
> pattern,
>     > since
>     >     > > > it’s a range
>     >     > > > >     > > > > lookup
>     >     > > > >     > > > > >> then a key lookup (or if we denormalize
> the
>     > MaxMind
>     >     > > data,
>     >     > > > just a
>     >     > > > >     > > range
>     >     > > > >     > > > > >> lookup) for that kind of thing MapDB with
>     > something
>     >     > like
>     >     > > > the BTree
>     >     > > > >     > > > > seems a
>     >     > > > >     > > > > >> good fit.
>     >     > > > >     > > > > >>
>     >     > > > >     > > > > >> Simon
>     >     > > > >     > > > > >>
>     >     > > > >     > > > > >>
>     >     > > > >     > > > > >>> On 16 Jan 2017, at 16:28, David Lyle <
>     >     > > > dlyle65535@gmail.com>
>     >     > > > >     > wrote:
>     >     > > > >     > > > > >>>
>     >     > > > >     > > > > >>> I'm +1 on removing the MySQL dependency,
> BUT -
>     > I'd
>     >     > > > prefer to see
>     >     > > > >     > it
>     >     > > > >     > > > as
>     >     > > > >     > > > > an
>     >     > > > >     > > > > >>> HBase enrichment. If our current caching
> isn't
>     > enough
>     >     > > to
>     >     > > > mitigate
>     >     > > > >     > > the
>     >     > > > >     > > > > >> above
>     >     > > > >     > > > > >>> issues, we have a problem, don't we? Or
> do we
>     > not
>     >     > > > recommend HBase
>     >     > > > >     > > > > >>> enrichment for per message enrichment in
>     > general?
>     >     > > > >     > > > > >>>
>     >     > > > >     > > > > >>> Also- can you elaborate on how MapDB
> would not
>     >     > require
>     >     > > a
>     >     > > > network
>     >     > > > >     > > hop?
>     >     > > > >     > > > > >>> Doesn't this mean we would have to sync
> the
>     >     > enrichment
>     >     > > > data to
>     >     > > > >     > each
>     >     > > > >     > > > > Storm
>     >     > > > >     > > > > >>> supervisor? HDFS could (probably would)
> have a
>     >     > network
>     >     > > > hop too,
>     >     > > > >     > no?
>     >     > > > >     > > > > >>>
>     >     > > > >     > > > > >>> Fwiw -
>     >     > > > >     > > > > >>> "In its place, I've looked at using
> MapDB,
>     > which is a
>     >     > > > really easy
>     >     > > > >     > > to
>     >     > > > >     > > > > use
>     >     > > > >     > > > > >>> library for creating Java collections
> backed
>     > by a
>     >     > file
>     >     > > > (This is
>     >     > > > >     > > NOT a
>     >     > > > >     > > > > >>> separate installation of anything, it's
> just a
>     > jar
>     >     > that
>     >     > > > manages
>     >     > > > >     > > > > >> interaction
>     >     > > > >     > > > > >>> with the file system). Given the slow
> churn of
>     > the
>     >     > > GeoIP
>     >     > > > files
>     >     > > > >     > (I
>     >     > > > >     > > > > >> believe
>     >     > > > >     > > > > >>> they get updated once a week), we can
> have a
>     > script
>     >     > > that
>     >     > > > can be
>     >     > > > >     > run
>     >     > > > >     > > > > when
>     >     > > > >     > > > > >>> needed, downloads the MaxMind tar file,
> builds
>     > the
>     >     > > MapDB
>     >     > > > file
>     >     > > > >     > that
>     >     > > > >     > > > will
>     >     > > > >     > > > > >> be
>     >     > > > >     > > > > >>> used by the bolts, and places it into
> HDFS.
>     > Finally,
>     >     > we
>     >     > > > update a
>     >     > > > >     > > > > config
>     >     > > > >     > > > > >> to
>     >     > > > >     > > > > >>> point to the new file, the bolts get the
>     > updated
>     >     > config
>     >     > > > callback
>     >     > > > >     > > and
>     >     > > > >     > > > > can
>     >     > > > >     > > > > >>> update their db files. Inside the code,
> we
>     > wrap the
>     >     > > MapDB
>     >     > > > >     > portions
>     >     > > > >     > > > to
>     >     > > > >     > > > > >> make
>     >     > > > >     > > > > >>> it transparent to downstream code."
>     >     > > > >     > > > > >>>
>     >     > > > >     > > > > >>> Seems a bit more complex than "refresh
> the
>     > hbase
>     >     > > table".
>     >     > > > Afaik,
>     >     > > > >     > > > either
>     >     > > > >     > > > > >>> approach would require some sort of
> translation
>     >     > between
>     >     > > > GeoIP
>     >     > > > >     > > source
>     >     > > > >     > > > > >> format
>     >     > > > >     > > > > >>> and target format, so that part is a
> wash imo.
>     >     > > > >     > > > > >>>
>     >     > > > >     > > > > >>> So, I'd really like to see, at least, an
>     > attempt to
>     >     > > > leverage
>     >     > > > >     > HBase
>     >     > > > >     > > > > >>> enrichment.
>     >     > > > >     > > > > >>>
>     >     > > > >     > > > > >>> -D...
>     >     > > > >     > > > > >>>
>     >     > > > >     > > > > >>>
>     >     > > > >     > > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey
> Stella
>     > <
>     >     > > > >     > cestella@gmail.com
>     >     > > > >     > > >
>     >     > > > >     > > > > >> wrote:
>     >     > > > >     > > > > >>>
>     >     > > > >     > > > > >>>> I think that it's a sensible thing to
> use
>     > MapDB for
>     >     > > the
>     >     > > > geo
>     >     > > > >     > > > > enrichment.
>     >     > > > >     > > > > >>>> Let me state my reasoning:
>     >     > > > >     > > > > >>>>
>     >     > > > >     > > > > >>>> - An HBase implementation would
> necessitate a
>     > HBase
>     >     > > scan
>     >     > > > >     > > possibly
>     >     > > > >     > > > > >>>> hitting HDFS, which is expensive
> per-message.
>     >     > > > >     > > > > >>>> - An HBase implementation would
> necessitate a
>     >     > network
>     >     > > > hop and
>     >     > > > >     > > MapDB
>     >     > > > >     > > > > >>>> would not.
>     >     > > > >     > > > > >>>>
>     >     > > > >     > > > > >>>> I also think this might be the
> beginning of a
>     > more
>     >     > > > general
>     >     > > > >     > purpose
>     >     > > > >     > > > > >> support
>     >     > > > >     > > > > >>>> in Stellar for locally shipped,
> read-only
>     > MapDB
>     >     > > > lookups, which
>     >     > > > >     > > might
>     >     > > > >     > > > > be
>     >     > > > >     > > > > >>>> interesting.
>     >     > > > >     > > > > >>>>
>     >     > > > >     > > > > >>>> In short, all quotes about premature
>     > optimization
>     >     > are
>     >     > > > sure to
>     >     > > > >     > > apply
>     >     > > > >     > > > to
>     >     > > > >     > > > > >> my
>     >     > > > >     > > > > >>>> reasoning, but I can't help but have my
> spidey
>     >     > senses
>     >     > > > tingle
>     >     > > > >     > when
>     >     > > > >     > > we
>     >     > > > >     > > > > >>>> introduce a scan-per-message
> architecture.
>     >     > > > >     > > > > >>>>
>     >     > > > >     > > > > >>>> Casey
>     >     > > > >     > > > > >>>>
>     >     > > > >     > > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima
>     > Kovalyov <
>     >     > > > >     > > > > >> Dima.Kovalyov@sstech.us>
>     >     > > > >     > > > > >>>> wrote:
>     >     > > > >     > > > > >>>>
>     >     > > > >     > > > > >>>>> Hello Justin,
>     >     > > > >     > > > > >>>>>
>     >     > > > >     > > > > >>>>> Considering that Metron uses hbase
> tables for
>     >     > storing
>     >     > > > >     > enrichment
>     >     > > > >     > > > and
>     >     > > > >     > > > > >>>>> threatintel feeds, can we use Hbase
> for geo
>     >     > > enrichment
>     >     > > > as well?
>     >     > > > >     > > > > >>>>> Or MapDB can be used for enrichment and
>     > threatintel
>     >     > > > feeds
>     >     > > > >     > instead
>     >     > > > >     > > > of
>     >     > > > >     > > > > >>>> hbase?
>     >     > > > >     > > > > >>>>>
>     >     > > > >     > > > > >>>>> - Dima
>     >     > > > >     > > > > >>>>>
>     >     > > > >     > > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet
> wrote:
>     >     > > > >     > > > > >>>>>> Hi all,
>     >     > > > >     > > > > >>>>>>
>     >     > > > >     > > > > >>>>>> As a bit of background, right now,
> GeoIP
>     > data is
>     >     > > > loaded into
>     >     > > > >     > and
>     >     > > > >     > > > > >>>> managed
>     >     > > > >     > > > > >>>>> by
>     >     > > > >     > > > > >>>>>> MySQL (the connectors are LGPL
> licensed and
>     > we
>     >     > need
>     >     > > > to sever
>     >     > > > >     > our
>     >     > > > >     > > > > Maven
>     >     > > > >     > > > > >>>>>> dependency on it before next
> release). We
>     >     > currently
>     >     > > > depend on
>     >     > > > >     > > and
>     >     > > > >     > > > > >>>> install
>     >     > > > >     > > > > >>>>>> an instance of MySQL (in each of the
>     > Management
>     >     > > Pack,
>     >     > > > Ansible,
>     >     > > > >     > > and
>     >     > > > >     > > > > >>>> Docker
>     >     > > > >     > > > > >>>>>> installs). In the topology, we use the
>     > JDBCAdapter
>     >     > > to
>     >     > > > connect
>     >     > > > >     > to
>     >     > > > >     > > > > MySQL
>     >     > > > >     > > > > >>>>> and
>     >     > > > >     > > > > >>>>>> query for a given IP. Additionally,
> it's a
>     > single
>     >     > > > point of
>     >     > > > >     > > > failure
>     >     > > > >     > > > > >> for
>     >     > > > >     > > > > >>>>>> that particular enrichment right now.
> If
>     > MySQL is
>     >     > > > down, geo
>     >     > > > >     > > > > >> enrichment
>     >     > > > >     > > > > >>>>>> can't occur.
>     >     > > > >     > > > > >>>>>>
>     >     > > > >     > > > > >>>>>> I'm proposing that we eliminate the
> use of
>     > MySQL
>     >     > > > entirely,
>     >     > > > >     > > through
>     >     > > > >     > > > > all
>     >     > > > >     > > > > >>>>>> installation paths (which, unless I
> missed
>     > some,
>     >     > > > includes
>     >     > > > >     > > Ansible,
>     >     > > > >     > > > > the
>     >     > > > >     > > > > >>>>>> Ambari Management Pack, and Docker).
> We'd
>     > do this
>     >     > by
>     >     > > > dropping
>     >     > > > >     > > all
>     >     > > > >     > > > > the
>     >     > > > >     > > > > >>>>>> various MySQL setup and management
> through
>     > the
>     >     > code,
>     >     > > > along
>     >     > > > >     > with
>     >     > > > >     > > > all
>     >     > > > >     > > > > >> the
>     >     > > > >     > > > > >>>>>> DDL, etc. The JDBCAdapter would stay,
> so
>     > that
>     >     > > anybody
>     >     > > > who
>     >     > > > >     > wants
>     >     > > > >     > > > to
>     >     > > > >     > > > > >>>> setup
>     >     > > > >     > > > > >>>>>> their own databases for enrichments
> and
>     > install
>     >     > > > connectors is
>     >     > > > >     > > able
>     >     > > > >     > > > > to
>     >     > > > >     > > > > >>>> do
>     >     > > > >     > > > > >>>>> so.
>     >     > > > >     > > > > >>>>>>
>     >     > > > >     > > > > >>>>>> In its place, I've looked at using
> MapDB,
>     > which
>     >     > is a
>     >     > > > really
>     >     > > > >     > easy
>     >     > > > >     > > > to
>     >     > > > >     > > > > >> use
>     >     > > > >     > > > > >>>>>> library for creating Java collections
>     > backed by a
>     >     > > > file (This
>     >     > > > >     > is
>     >     > > > >     > > > NOT
>     >     > > > >     > > > > a
>     >     > > > >     > > > > >>>>>> separate installation of anything,
> it's
>     > just a jar
>     >     > > > that
>     >     > > > >     > manages
>     >     > > > >     > > > > >>>>> interaction
>     >     > > > >     > > > > >>>>>> with the file system). Given the slow
> churn
>     > of the
>     >     > > > GeoIP
>     >     > > > >     > files
>     >     > > > >     > > (I
>     >     > > > >     > > > > >>>>> believe
>     >     > > > >     > > > > >>>>>> they get updated once a week), we can
> have a
>     >     > script
>     >     > > > that can
>     >     > > > >     > be
>     >     > > > >     > > > run
>     >     > > > >     > > > > >>>> when
>     >     > > > >     > > > > >>>>>> needed, downloads the MaxMind tar
> file,
>     > builds the
>     >     > > > MapDB file
>     >     > > > >     > > that
>     >     > > > >     > > > > >> will
>     >     > > > >     > > > > >>>>> be
>     >     > > > >     > > > > >>>>>> used by the bolts, and places it into
> HDFS.
>     >     > Finally,
>     >     > > > we
>     >     > > > >     > update
>     >     > > > >     > > a
>     >     > > > >     > > > > >>>> config
>     >     > > > >     > > > > >>>>> to
>     >     > > > >     > > > > >>>>>> point to the new file, the bolts get
> the
>     > updated
>     >     > > > config
>     >     > > > >     > callback
>     >     > > > >     > > > and
>     >     > > > >     > > > > >>>> can
>     >     > > > >     > > > > >>>>>> update their db files. Inside the
> code, we
>     > wrap
>     >     > the
>     >     > > > MapDB
>     >     > > > >     > > > portions
>     >     > > > >     > > > > to
>     >     > > > >     > > > > >>>>> make
>     >     > > > >     > > > > >>>>>> it transparent to downstream code.
>     >     > > > >     > > > > >>>>>>
>     >     > > > >     > > > > >>>>>> The particularly nice parts about
> using
>     > MapDB are
>     >     > > > that its
>     >     > > > >     > ease
>     >     > > > >     > > of
>     >     > > > >     > > > > use
>     >     > > > >     > > > > >>>>> plus
>     >     > > > >     > > > > >>>>>> it offers the utilities we need out
> of the
>     > box to
>     >     > be
>     >     > > > able to
>     >     > > > >     > > > support
>     >     > > > >     > > > > >>>> the
>     >     > > > >     > > > > >>>>>> operations we need on this (Keep in
> mind
>     > the GeoIP
>     >     > > > files use
>     >     > > > >     > IP
>     >     > > > >     > > > > ranges
>     >     > > > >     > > > > >>>>> and
>     >     > > > >     > > > > >>>>>> we need to be able to easily grab the
>     > appropriate
>     >     > > > range).
>     >     > > > >     > > > > >>>>>>
>     >     > > > >     > > > > >>>>>> The main point of concern I have
> about this
>     > is
>     >     > that
>     >     > > > when we
>     >     > > > >     > grab
>     >     > > > >     > > > the
>     >     > > > >     > > > > >>>> HDFS
>     >     > > > >     > > > > >>>>>> file during an update, given that
> multiple
>     > JVMs
>     >     > can
>     >     > > be
>     >     > > > >     > running,
>     >     > > > >     > > we
>     >     > > > >     > > > > >>>> don't
>     >     > > > >     > > > > >>>>>> want them to clobber each other. I
> believe
>     > this
>     >     > can
>     >     > > > be avoided
>     >     > > > >     > > by
>     >     > > > >     > > > > >>>> simply
>     >     > > > >     > > > > >>>>>> using each worker's working directory
> to
>     > store the
>     >     > > > file (and
>     >     > > > >     > > > > >>>>> appropriately
>     >     > > > >     > > > > >>>>>> ensure threads on the same JVM manage
>     >     > > > multithreading). This
>     >     > > > >     > > > should
>     >     > > > >     > > > > >>>> keep
>     >     > > > >     > > > > >>>>>> the JVMs (and the underlying DB files)
>     > entirely
>     >     > > > independent.
>     >     > > > >     > > > > >>>>>>
>     >     > > > >     > > > > >>>>>> This script would get called by the
> various
>     >     > > > installations
>     >     > > > >     > during
>     >     > > > >     > > > > >>>> startup
>     >     > > > >     > > > > >>>>> to
>     >     > > > >     > > > > >>>>>> do the initial setup. After install,
> it can
>     > then
>     >     > be
>     >     > > > called on
>     >     > > > >     > > > > demand
>     >     > > > >     > > > > >>>> in
>     >     > > > >     > > > > >>>>>> order.
>     >     > > > >     > > > > >>>>>>
>     >     > > > >     > > > > >>>>>> At this point, we should be all set,
> with
>     >     > everything
>     >     > > > running
>     >     > > > >     > and
>     >     > > > >     > > > > >>>>> updatable.
>     >     > > > >     > > > > >>>>>>
>     >     > > > >     > > > > >>>>>> Justin
>     >     > > > >     > > > > >>>>>>
>     >     > > > >     > > > > >>>>>
>     >     > > > >     > > > > >>>>>
>     >     > > > >     > > > > >>>>
>     >     > > > >     > > > > >>
>     >     > > > >     > > > > >>
>     >     > > > >     > > > >
>     >     > > > >     > > > >
>     >     > > > >     > > >
>     >     > > > >     > >
>     >     > > > >     >
>     >     > > >
>     >     > > > -------------------
>     >     > > > Thank you,
>     >     > > >
>     >     > > > James Sirota
>     >     > > > PPMC- Apache Metron (Incubating)
>     >     > > > jsirota AT apache DOT org
>     >     > > >
>     >     > >
>     >     >
>     >     >
>     >     >
>     >     > --
>     >     > Nick Allen <nick@nickallen.org>
>     >     >
>     >
>     >
>     >
>     >
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message