metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <ma...@apache.org>
Subject Re: [DISCUSS] Moving GeoIP management away from MySQL
Date Mon, 16 Jan 2017 23:45:40 GMT
I agree too.  I confirmed the GeoIP2 Java API is ASF2.0 licensed, as you all no doubt knew
already.

Just a couple comments and a question:

First note that storing data in HDFS, while it avoids the deployment question, also induces
a network hop to read it.
Presumably that only happens once per update per geo bolt instance, but how do you avoid re-reading
it frequently, to make sure you see updates?

Second, I just want to comment that there is not a single point of failure for an enterprise
db that has been properly set up for HA.  Granted that’s neither here nor there if we don’t
need a db, but it isn’t a valid argument against using a db. :-)

Thanks,
--Matt

On 1/16/17, 1:36 PM, "Michael Miklavcic" <michael.miklavcic@gmail.com> wrote:

    I'm also in agreement on this.
    
    On Mon, Jan 16, 2017 at 2:11 PM, Nick Allen <nick@nickallen.org> wrote:
    
    > +1 to using the Java API with the MMDB file provided by Maxmind.  This is
    > what I had thought we were doing when we discussed this a few months back.
    > I'd rather use the Maxmind tools as-provided instead of engineering
    > something on top of it.
    >
    > On Mon, Jan 16, 2017 at 3:59 PM, JJ Meyer <jjmeyer0@gmail.com> wrote:
    >
    > > Matt, I agree with your points on why we shouldn't just get rid of the
    > > database just to get rid of a database. But IMO, I think we may be
    > > reinventing the wheel a little bit by even putting the maxmind data into
    > > MySQL. Right now we are already downloading a maxmind file. To me it
    > seems
    > > simpler to push the file to HDFS where we can pick it up and have the
    > > maxmind client use that instead of importing data into a DB and then
    > > running a query. Also, I believe the data gets updated weekly. So syncing
    > > may become easier too.
    > >
    > > James, I believe it works with the paid and free versions of geoip. I
    > know
    > > NiFi uses this client library in their Geo enrichment processor.
    > >
    > > Also, if it is decided that using a SQL database is still the best
    > > solution, I think there is a benefit to using their library. We would
    > just
    > > have to implement a `DatabaseProvider` that hits some SQL db instead of
    > > using their standard implementation.
    > >
    > > Thanks,
    > > JJ
    > >
    > > On Mon, Jan 16, 2017 at 2:27 PM, James Sirota <jsirota@apache.org>
    > wrote:
    > >
    > > > Hi Guys, I just wanted to clarify one point that I think is lost in
    > this
    > > > tread.  Geo enrichment is NOT a key-value enrichment.  It requires a
    > > range
    > > > scan and a join (which is why it's implemented via mySql and not
    > Hbase).
    > > > To account for this access pattern via a key-value store you would
    > > > inevitably have to do something funky or in case of Hbase I don't think
    > > > there is a way to avoid doing a range scan.
    > > >
    > > > With respect to mapdb it only has support for Maps, Sets, Lists,
    > Queues.
    > > > Are we sure it provides enough functionality for us to do this
    > > enrichment?
    > > >
    > > > With respect to the Maxmind client, are we sure we can use it on the
    > > > mySql-backed version of their DB?  I thought the Maxmind database
    > itself
    > > is
    > > > proprietary and is something you have to pay for.  My understanding is
    > > that
    > > > the client is designed for that proprietary version.
    > > >
    > > > I somewhat agree with Matt's point.  If mySql is a problem because of
    > > > licensing, the path of least resistance to remove mySql dependencies
    > > would
    > > > be to simply switch to postgresql.  We will always have conventional
    > sql
    > > > databases in our stack because other big data tools use them. Why not
    > > take
    > > > advantage of them too?
    > > >
    > > > Thanks,
    > > > James
    > > >
    > > > 16.01.2017, 12:27, "Matt Foley" <mattf@apache.org>:
    > > > > Hi Justin, and team,
    > > > > Several components of the Hadoop Stack utilize a SQL database,
    > usually
    > > > for metadata of some sort. Ambari knows this and arranges for them to
    > > share
    > > > a single database installation (on or off the cluster), unless they
    > > > explicitly configure use of different databases (which is allowed for
    > > sites
    > > > that desire it). Ambari defaults to using PostgreSQL, altho it’s happy
    > to
    > > > use MySQL, Oracle, or Microsoft, along with whatever each component
    > > > historically defined as their default (such as Derby).
    > > > >
    > > > > If we want to start with a replacement of current functionality, I
    > > would
    > > > suggest switching the default database to PostgreSQL. Replacing fast,
    > > > efficient, and proven db services with a file-based api library (but no
    > > > standard way to propagate the underlying storage files) seems to me to
    > be
    > > > taking a step backwards.
    > > > >
    > > > > Sticking with a SQL-based service will surely minimize the amount
of
    > > > code changes needed. And making the SQL either dialect-independent or
    > > > capable of switching among dialects, then enables us to do what the
    > rest
    > > of
    > > > the Hadoop stack does: allow enterprise customers to substitute Oracle
    > or
    > > > Microsoft enterprise-class databases where they wish. Regarding the
    > > > drivers, we should study what the other Stack components do; I’m not
an
    > > > expert in those areas.
    > > > >
    > > > > Using the same db as the rest of the stack also means administrators
    > > can
    > > > be confident they’ve set up adequate backup and recovery processes.
    > > > > All these are valuable reasons not to roll our own storage system
for
    > > > this enrichment data. IMO, of course.
    > > > >
    > > > > Cheers,
    > > > > --Matt
    > > > >
    > > > > On 1/16/17, 9:52 AM, "Kyle Richardson" <kylerichardson2@gmail.com>
    > > > wrote:
    > > > >
    > > > >     +1 Agree with David's order
    > > > >
    > > > >     -Kyle
    > > > >
    > > > >     On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <
    > dlyle65535@gmail.com
    > > >
    > > > wrote:
    > > > >
    > > > >     > Def agree on the parity point.
    > > > >     >
    > > > >     > I'm a little worried about Supervisor relocations for non-HBase
    > > > solutions,
    > > > >     > but having much of the work done for us by MaxMind changes
my
    > > > preference to
    > > > >     > (in order)
    > > > >     >
    > > > >     > 1) MM API
    > > > >     > 2) HBase Enrichment
    > > > >     > 3) MapDB should the others prove not feasible
    > > > >     >
    > > > >     >
    > > > >     > -D...
    > > > >     >
    > > > >     >
    > > > >     > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <
    > > > justinjleet@gmail.com>
    > > > >     > wrote:
    > > > >     >
    > > > >     > > I definitely agree on checking out the MaxMind API.
I'll
    > take a
    > > > look at
    > > > >     > > it, but at first glance it looks like it does include
    > > everything
    > > > we use.
    > > > >     > > Great find, JJ.
    > > > >     > >
    > > > >     > > More details on various people's points:
    > > > >     > >
    > > > >     > > As a note to anyone hopping in, Simon's point on the
range
    > > > lookup vs a
    > > > >     > key
    > > > >     > > lookup is why it becomes a Scan in HBase vs a Get. As
an
    > > > addendum to
    > > > >     > what
    > > > >     > > Simon mentioned, denormalizing is easy enough and turns
it
    > into
    > > > an easy
    > > > >     > > range lookup.
    > > > >     > >
    > > > >     > > To David's point, the MapDB does require a network hop,
but
    > > it's
    > > > once per
    > > > >     > > refresh of the data (Got a relevant callback? Grab new
data,
    > > > load it,
    > > > >     > swap
    > > > >     > > out) instead of (up to) once per message. I would expect
the
    > > > same to be
    > > > >     > > true of the MaxMind db files.
    > > > >     > >
    > > > >     > > I'd also argue MapDB not really more complex than refreshing
    > > the
    > > > HBase
    > > > >     > > table, because we potentially have to start worrying
about
    > > > things like
    > > > >     > > hashing and/or indices and even just general data
    > represtation.
    > > > It's
    > > > >     > > definitely correct that the file processing has to occur
on
    > > > either path,
    > > > >     > so
    > > > >     > > it really boils down to handling the callback and reloading
    > the
    > > > file vs
    > > > >     > > handling some of the standard HBasey things. I don't
think
    > > > either is an
    > > > >     > > enormous amount of work (and both are almost certainly
more
    > > work
    > > > than
    > > > >     > > MaxMind's API)
    > > > >     > >
    > > > >     > > Regarding extensibility, I'd argue for parity with what
we
    > have
    > > > first,
    > > > >     > then
    > > > >     > > build what we need from there. Does anybody have any
    > > > disagreement with
    > > > >     > > that approach for right now?
    > > > >     > >
    > > > >     > > Justin
    > > > >     > >
    > > > >     > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <
    > > > dlyle65535@gmail.com>
    > > > >     > wrote:
    > > > >     > >
    > > > >     > > > It is interesting- save us a ton of effort, and
has the
    > right
    > > > license.
    > > > >     > I
    > > > >     > > > think it's worth at least checking out.
    > > > >     > > >
    > > > >     > > > -D...
    > > > >     > > >
    > > > >     > > >
    > > > >     > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston
Ball <
    > > > >     > > > simon@simonellistonball.com> wrote:
    > > > >     > > >
    > > > >     > > > > I like that approach even more. That way we
would only
    > have
    > > > to worry
    > > > >     > > > about
    > > > >     > > > > distributing the database file in binary format
to all
    > the
    > > > supervisor
    > > > >     > > > nodes
    > > > >     > > > > on update.
    > > > >     > > > >
    > > > >     > > > > It would also make it easier for people to
switch to the
    > > > enterprise
    > > > >     > DB
    > > > >     > > > > potentially if they had the license.
    > > > >     > > > >
    > > > >     > > > > One slight issue with this might be for people
who wanted
    > > to
    > > > extend
    > > > >     > the
    > > > >     > > > > database. For example, organisations may want
to add
    > > > geo-enrichment
    > > > >     > to
    > > > >     > > > > their own private network addresses based
modified
    > versions
    > > > of the
    > > > >     > geo
    > > > >     > > > > database. Currently we don’t really allow
this, since we
    > > > hard-code
    > > > >     > > > ignoring
    > > > >     > > > > private network classes into the geo enrichment
adapter,
    > > but
    > > > I can
    > > > >     > see
    > > > >     > > a
    > > > >     > > > > case where a global org might want to add
their own
    > ranges
    > > > and
    > > > >     > > locations
    > > > >     > > > to
    > > > >     > > > > the data set. Does that make sense to anyone
else?
    > > > >     > > > >
    > > > >     > > > > Simon
    > > > >     > > > >
    > > > >     > > > >
    > > > >     > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jjmeyer0@gmail.com
    > >
    > > > wrote:
    > > > >     > > > > >
    > > > >     > > > > > Hello all,
    > > > >     > > > > >
    > > > >     > > > > > Can we leverage maxmind's Java client
(
    > > > >     > > > > > https://github.com/maxmind/
    > GeoIP2-java/tree/master/src/
    > > > >     > > > > main/java/com/maxmind/geoip2)
    > > > >     > > > > > in this case? I believe it can directly
read maxmind
    > > file.
    > > > Plus I
    > > > >     > > think
    > > > >     > > > > it
    > > > >     > > > > > also has some support for caching as
well.
    > > > >     > > > > >
    > > > >     > > > > > Thanks,
    > > > >     > > > > > JJ
    > > > >     > > > > >
    > > > >     > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon
Elliston Ball <
    > > > >     > > > > > simon@simonellistonball.com> wrote:
    > > > >     > > > > >
    > > > >     > > > > >> I like the idea of MapDB, since we
can essentially
    > pull
    > > an
    > > > >     > instance
    > > > >     > > > into
    > > > >     > > > > >> each supervisor, so it makes a lot
of sense for
    > > > relatively small
    > > > >     > > > scale,
    > > > >     > > > > >> relatively static enrichments in
general.
    > > > >     > > > > >>
    > > > >     > > > > >> Generally this feels like a caching
problem, and would
    > > be
    > > > for a
    > > > >     > > simple
    > > > >     > > > > >> key-value lookup. In that case I
would agree with
    > David
    > > > Lyle on
    > > > >     > > using
    > > > >     > > > > HBase
    > > > >     > > > > >> as a source or truth and relying
on caching.
    > > > >     > > > > >>
    > > > >     > > > > >> That said, GeoIP is a different lookup
pattern, since
    > > > it’s a range
    > > > >     > > > > lookup
    > > > >     > > > > >> then a key lookup (or if we denormalize
the MaxMind
    > > data,
    > > > just a
    > > > >     > > range
    > > > >     > > > > >> lookup) for that kind of thing MapDB
with something
    > like
    > > > the BTree
    > > > >     > > > > seems a
    > > > >     > > > > >> good fit.
    > > > >     > > > > >>
    > > > >     > > > > >> Simon
    > > > >     > > > > >>
    > > > >     > > > > >>
    > > > >     > > > > >>> On 16 Jan 2017, at 16:28, David
Lyle <
    > > > dlyle65535@gmail.com>
    > > > >     > wrote:
    > > > >     > > > > >>>
    > > > >     > > > > >>> I'm +1 on removing the MySQL
dependency, BUT - I'd
    > > > prefer to see
    > > > >     > it
    > > > >     > > > as
    > > > >     > > > > an
    > > > >     > > > > >>> HBase enrichment. If our current
caching isn't enough
    > > to
    > > > mitigate
    > > > >     > > the
    > > > >     > > > > >> above
    > > > >     > > > > >>> issues, we have a problem, don't
we? Or do we not
    > > > recommend HBase
    > > > >     > > > > >>> enrichment for per message enrichment
in general?
    > > > >     > > > > >>>
    > > > >     > > > > >>> Also- can you elaborate on how
MapDB would not
    > require
    > > a
    > > > network
    > > > >     > > hop?
    > > > >     > > > > >>> Doesn't this mean we would have
to sync the
    > enrichment
    > > > data to
    > > > >     > each
    > > > >     > > > > Storm
    > > > >     > > > > >>> supervisor? HDFS could (probably
would) have a
    > network
    > > > hop too,
    > > > >     > no?
    > > > >     > > > > >>>
    > > > >     > > > > >>> Fwiw -
    > > > >     > > > > >>> "In its place, I've looked at
using MapDB, which is a
    > > > really easy
    > > > >     > > to
    > > > >     > > > > use
    > > > >     > > > > >>> library for creating Java collections
backed by a
    > file
    > > > (This is
    > > > >     > > NOT a
    > > > >     > > > > >>> separate installation of anything,
it's just a jar
    > that
    > > > manages
    > > > >     > > > > >> interaction
    > > > >     > > > > >>> with the file system). Given
the slow churn of the
    > > GeoIP
    > > > files
    > > > >     > (I
    > > > >     > > > > >> believe
    > > > >     > > > > >>> they get updated once a week),
we can have a script
    > > that
    > > > can be
    > > > >     > run
    > > > >     > > > > when
    > > > >     > > > > >>> needed, downloads the MaxMind
tar file, builds the
    > > MapDB
    > > > file
    > > > >     > that
    > > > >     > > > will
    > > > >     > > > > >> be
    > > > >     > > > > >>> used by the bolts, and places
it into HDFS. Finally,
    > we
    > > > update a
    > > > >     > > > > config
    > > > >     > > > > >> to
    > > > >     > > > > >>> point to the new file, the bolts
get the updated
    > config
    > > > callback
    > > > >     > > and
    > > > >     > > > > can
    > > > >     > > > > >>> update their db files. Inside
the code, we wrap the
    > > MapDB
    > > > >     > portions
    > > > >     > > > to
    > > > >     > > > > >> make
    > > > >     > > > > >>> it transparent to downstream
code."
    > > > >     > > > > >>>
    > > > >     > > > > >>> Seems a bit more complex than
"refresh the hbase
    > > table".
    > > > Afaik,
    > > > >     > > > either
    > > > >     > > > > >>> approach would require some sort
of translation
    > between
    > > > GeoIP
    > > > >     > > source
    > > > >     > > > > >> format
    > > > >     > > > > >>> and target format, so that part
is a wash imo.
    > > > >     > > > > >>>
    > > > >     > > > > >>> So, I'd really like to see, at
least, an attempt to
    > > > leverage
    > > > >     > HBase
    > > > >     > > > > >>> enrichment.
    > > > >     > > > > >>>
    > > > >     > > > > >>> -D...
    > > > >     > > > > >>>
    > > > >     > > > > >>>
    > > > >     > > > > >>> On Mon, Jan 16, 2017 at 11:02
AM, Casey Stella <
    > > > >     > cestella@gmail.com
    > > > >     > > >
    > > > >     > > > > >> wrote:
    > > > >     > > > > >>>
    > > > >     > > > > >>>> I think that it's a sensible
thing to use MapDB for
    > > the
    > > > geo
    > > > >     > > > > enrichment.
    > > > >     > > > > >>>> Let me state my reasoning:
    > > > >     > > > > >>>>
    > > > >     > > > > >>>> - An HBase implementation
would necessitate a HBase
    > > scan
    > > > >     > > possibly
    > > > >     > > > > >>>> hitting HDFS, which is expensive
per-message.
    > > > >     > > > > >>>> - An HBase implementation
would necessitate a
    > network
    > > > hop and
    > > > >     > > MapDB
    > > > >     > > > > >>>> would not.
    > > > >     > > > > >>>>
    > > > >     > > > > >>>> I also think this might be
the beginning of a more
    > > > general
    > > > >     > purpose
    > > > >     > > > > >> support
    > > > >     > > > > >>>> in Stellar for locally shipped,
read-only MapDB
    > > > lookups, which
    > > > >     > > might
    > > > >     > > > > be
    > > > >     > > > > >>>> interesting.
    > > > >     > > > > >>>>
    > > > >     > > > > >>>> In short, all quotes about
premature optimization
    > are
    > > > sure to
    > > > >     > > apply
    > > > >     > > > to
    > > > >     > > > > >> my
    > > > >     > > > > >>>> reasoning, but I can't help
but have my spidey
    > senses
    > > > tingle
    > > > >     > when
    > > > >     > > we
    > > > >     > > > > >>>> introduce a scan-per-message
architecture.
    > > > >     > > > > >>>>
    > > > >     > > > > >>>> Casey
    > > > >     > > > > >>>>
    > > > >     > > > > >>>> On Mon, Jan 16, 2017 at 10:53
AM, Dima Kovalyov <
    > > > >     > > > > >> Dima.Kovalyov@sstech.us>
    > > > >     > > > > >>>> wrote:
    > > > >     > > > > >>>>
    > > > >     > > > > >>>>> Hello Justin,
    > > > >     > > > > >>>>>
    > > > >     > > > > >>>>> Considering that Metron
uses hbase tables for
    > storing
    > > > >     > enrichment
    > > > >     > > > and
    > > > >     > > > > >>>>> threatintel feeds, can
we use Hbase for geo
    > > enrichment
    > > > as well?
    > > > >     > > > > >>>>> Or MapDB can be used
for enrichment and threatintel
    > > > feeds
    > > > >     > instead
    > > > >     > > > of
    > > > >     > > > > >>>> hbase?
    > > > >     > > > > >>>>>
    > > > >     > > > > >>>>> - Dima
    > > > >     > > > > >>>>>
    > > > >     > > > > >>>>> On 01/16/2017 04:17 PM,
Justin Leet wrote:
    > > > >     > > > > >>>>>> Hi all,
    > > > >     > > > > >>>>>>
    > > > >     > > > > >>>>>> As a bit of background,
right now, GeoIP data is
    > > > loaded into
    > > > >     > and
    > > > >     > > > > >>>> managed
    > > > >     > > > > >>>>> by
    > > > >     > > > > >>>>>> MySQL (the connectors
are LGPL licensed and we
    > need
    > > > to sever
    > > > >     > our
    > > > >     > > > > Maven
    > > > >     > > > > >>>>>> dependency on it
before next release). We
    > currently
    > > > depend on
    > > > >     > > and
    > > > >     > > > > >>>> install
    > > > >     > > > > >>>>>> an instance of MySQL
(in each of the Management
    > > Pack,
    > > > Ansible,
    > > > >     > > and
    > > > >     > > > > >>>> Docker
    > > > >     > > > > >>>>>> installs). In the
topology, we use the JDBCAdapter
    > > to
    > > > connect
    > > > >     > to
    > > > >     > > > > MySQL
    > > > >     > > > > >>>>> and
    > > > >     > > > > >>>>>> query for a given
IP. Additionally, it's a single
    > > > point of
    > > > >     > > > failure
    > > > >     > > > > >> for
    > > > >     > > > > >>>>>> that particular enrichment
right now. If MySQL is
    > > > down, geo
    > > > >     > > > > >> enrichment
    > > > >     > > > > >>>>>> can't occur.
    > > > >     > > > > >>>>>>
    > > > >     > > > > >>>>>> I'm proposing that
we eliminate the use of MySQL
    > > > entirely,
    > > > >     > > through
    > > > >     > > > > all
    > > > >     > > > > >>>>>> installation paths
(which, unless I missed some,
    > > > includes
    > > > >     > > Ansible,
    > > > >     > > > > the
    > > > >     > > > > >>>>>> Ambari Management
Pack, and Docker). We'd do this
    > by
    > > > dropping
    > > > >     > > all
    > > > >     > > > > the
    > > > >     > > > > >>>>>> various MySQL setup
and management through the
    > code,
    > > > along
    > > > >     > with
    > > > >     > > > all
    > > > >     > > > > >> the
    > > > >     > > > > >>>>>> DDL, etc. The JDBCAdapter
would stay, so that
    > > anybody
    > > > who
    > > > >     > wants
    > > > >     > > > to
    > > > >     > > > > >>>> setup
    > > > >     > > > > >>>>>> their own databases
for enrichments and install
    > > > connectors is
    > > > >     > > able
    > > > >     > > > > to
    > > > >     > > > > >>>> do
    > > > >     > > > > >>>>> so.
    > > > >     > > > > >>>>>>
    > > > >     > > > > >>>>>> In its place, I've
looked at using MapDB, which
    > is a
    > > > really
    > > > >     > easy
    > > > >     > > > to
    > > > >     > > > > >> use
    > > > >     > > > > >>>>>> library for creating
Java collections backed by a
    > > > file (This
    > > > >     > is
    > > > >     > > > NOT
    > > > >     > > > > a
    > > > >     > > > > >>>>>> separate installation
of anything, it's just a jar
    > > > that
    > > > >     > manages
    > > > >     > > > > >>>>> interaction
    > > > >     > > > > >>>>>> with the file system).
Given the slow churn of the
    > > > GeoIP
    > > > >     > files
    > > > >     > > (I
    > > > >     > > > > >>>>> believe
    > > > >     > > > > >>>>>> they get updated
once a week), we can have a
    > script
    > > > that can
    > > > >     > be
    > > > >     > > > run
    > > > >     > > > > >>>> when
    > > > >     > > > > >>>>>> needed, downloads
the MaxMind tar file, builds the
    > > > MapDB file
    > > > >     > > that
    > > > >     > > > > >> will
    > > > >     > > > > >>>>> be
    > > > >     > > > > >>>>>> used by the bolts,
and places it into HDFS.
    > Finally,
    > > > we
    > > > >     > update
    > > > >     > > a
    > > > >     > > > > >>>> config
    > > > >     > > > > >>>>> to
    > > > >     > > > > >>>>>> point to the new
file, the bolts get the updated
    > > > config
    > > > >     > callback
    > > > >     > > > and
    > > > >     > > > > >>>> can
    > > > >     > > > > >>>>>> update their db files.
Inside the code, we wrap
    > the
    > > > MapDB
    > > > >     > > > portions
    > > > >     > > > > to
    > > > >     > > > > >>>>> make
    > > > >     > > > > >>>>>> it transparent to
downstream code.
    > > > >     > > > > >>>>>>
    > > > >     > > > > >>>>>> The particularly
nice parts about using MapDB are
    > > > that its
    > > > >     > ease
    > > > >     > > of
    > > > >     > > > > use
    > > > >     > > > > >>>>> plus
    > > > >     > > > > >>>>>> it offers the utilities
we need out of the box to
    > be
    > > > able to
    > > > >     > > > support
    > > > >     > > > > >>>> the
    > > > >     > > > > >>>>>> operations we need
on this (Keep in mind the GeoIP
    > > > files use
    > > > >     > IP
    > > > >     > > > > ranges
    > > > >     > > > > >>>>> and
    > > > >     > > > > >>>>>> we need to be able
to easily grab the appropriate
    > > > range).
    > > > >     > > > > >>>>>>
    > > > >     > > > > >>>>>> The main point of
concern I have about this is
    > that
    > > > when we
    > > > >     > grab
    > > > >     > > > the
    > > > >     > > > > >>>> HDFS
    > > > >     > > > > >>>>>> file during an update,
given that multiple JVMs
    > can
    > > be
    > > > >     > running,
    > > > >     > > we
    > > > >     > > > > >>>> don't
    > > > >     > > > > >>>>>> want them to clobber
each other. I believe this
    > can
    > > > be avoided
    > > > >     > > by
    > > > >     > > > > >>>> simply
    > > > >     > > > > >>>>>> using each worker's
working directory to store the
    > > > file (and
    > > > >     > > > > >>>>> appropriately
    > > > >     > > > > >>>>>> ensure threads on
the same JVM manage
    > > > multithreading). This
    > > > >     > > > should
    > > > >     > > > > >>>> keep
    > > > >     > > > > >>>>>> the JVMs (and the
underlying DB files) entirely
    > > > independent.
    > > > >     > > > > >>>>>>
    > > > >     > > > > >>>>>> This script would
get called by the various
    > > > installations
    > > > >     > during
    > > > >     > > > > >>>> startup
    > > > >     > > > > >>>>> to
    > > > >     > > > > >>>>>> do the initial setup.
After install, it can then
    > be
    > > > called on
    > > > >     > > > > demand
    > > > >     > > > > >>>> in
    > > > >     > > > > >>>>>> order.
    > > > >     > > > > >>>>>>
    > > > >     > > > > >>>>>> At this point, we
should be all set, with
    > everything
    > > > running
    > > > >     > and
    > > > >     > > > > >>>>> updatable.
    > > > >     > > > > >>>>>>
    > > > >     > > > > >>>>>> Justin
    > > > >     > > > > >>>>>>
    > > > >     > > > > >>>>>
    > > > >     > > > > >>>>>
    > > > >     > > > > >>>>
    > > > >     > > > > >>
    > > > >     > > > > >>
    > > > >     > > > >
    > > > >     > > > >
    > > > >     > > >
    > > > >     > >
    > > > >     >
    > > >
    > > > -------------------
    > > > Thank you,
    > > >
    > > > James Sirota
    > > > PPMC- Apache Metron (Incubating)
    > > > jsirota AT apache DOT org
    > > >
    > >
    >
    >
    >
    > --
    > Nick Allen <nick@nickallen.org>
    >
    



Mime
View raw message