metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Allen <n...@nickallen.org>
Subject Re: [DISCUSS] Batch Profiler Feature Branch
Date Fri, 21 Sep 2018 18:36:52 GMT
Here is a PR that adds the input time constraints to the Batch Profiler
(METRON-1787);  https://github.com/apache/metron/pull/1209.

It seems that the consensus is that this is probably the last feature we
need before merging the FB into master.  The other two can wait until after
the feature branch has been merged.  Let me know if you disagree.

Thanks


On Thu, Sep 20, 2018 at 1:55 PM Nick Allen <nick@nickallen.org> wrote:

> Yeah, agreed.  Per use case 3, when deploying to production there really
> wouldn't be a huge overlap like 3 months of already profiled data.  Its day
> 1, the profile was just deployed around the same time as you are running
> the Batch Profiler, so the overlap is in minutes, maybe hours.  But I can
> definitely see the usefulness of the feature for re-runs, etc as you have
> described.
>
> Based on this discussion, I created a few JIRAs.  Thanks all for the great
> feedback and keep it coming.
>
> [1] METRON-1787 - Input Time Constraints for Batch Profiler
> [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch Profiler
> [3] METRON-1789 - MPack Should Define Default Input Path for Batch
> Profiler
>
>
> --
> [1] https://issues.apache.org/jira/browse/METRON-1787
> [2] https://issues.apache.org/jira/browse/METRON-1788
> [3] https://issues.apache.org/jira/browse/METRON-1789
>
>
>
>
>
>
> On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic <
> michael.miklavcic@gmail.com> wrote:
>
>> I think we might want to allow the flexibility to choose the date range
>> then. I don't yet feel like I have a good enough understanding of all the
>> ways in which users would want to seed to force them to run the batch job
>> over all the data. It might also make it easier to deal with remediation,
>> ie an error doesn't force you to re-run over the entire history. Same goes
>> for testing out the profile seeing batch job in the first place.
>>
>> On Thu, Sep 20, 2018 at 11:23 AM Nick Allen <nick@nickallen.org> wrote:
>>
>> > Assuming you have 9 months of data archived, yes.
>> >
>> > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic <
>> > michael.miklavcic@gmail.com> wrote:
>> >
>> > > So in the case of 3 - if you had 6 months of data that hadn't been
>> > profiled
>> > > and another 3 that had been profiled (9 months total data), in its
>> > current
>> > > form the batch job runs over all 9 months?
>> > >
>> > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <nick@nickallen.org>
>> wrote:
>> > >
>> > > > > How do we establish "tm" from 1.1 above? Any concerns about
>> overlap
>> > or
>> > > > gaps after the seeding is performed?
>> > > >
>> > > > Good point.  Right now, if the Streaming and Batch Profiler overlap
>> the
>> > > > last write wins.  And presumably the output of the Streaming and
>> Batch
>> > > > Profiler are the same, so no worries, right? :)
>> > > >
>> > > > So it kind of works, but it is definitely not ideal for use case
>> 3.  I
>> > > > could add --begin and --end args to constrain the time frame over
>> which
>> > > the
>> > > > Batch Profiler runs.  I do not have that in the feature branch.  It
>> > would
>> > > > be easy enough to add though.
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic <
>> > > > michael.miklavcic@gmail.com> wrote:
>> > > >
>> > > > > Ok, makes sense. That's sort of what I was thinking as well,
Nick.
>> > > > Pulling
>> > > > > at this thread just a bit more...
>> > > > >
>> > > > >    1. I have an existing system that's been up a while, and I
have
>> > > added
>> > > > k
>> > > > >    profiles - assume these are the first profiles I've created.
>> > > > >       1. I would have t0 - tm (where m is the time when the
>> profiles
>> > > were
>> > > > >       first installed) worth of data that has not been profiled
>> yet.
>> > > > >       2. The batch profiler process would be to take that exact
>> > profile
>> > > > >       definition from ZK and run the batch loader with that from
>> the
>> > > CLI.
>> > > > >       3. Profiles are now up to date from t0 - tCurrent
>> > > > >    2. I've already done #1 above. Time goes by and now I want
to
>> add
>> > a
>> > > > new
>> > > > >    profile.
>> > > > >       1. Same first step above
>> > > > >       2. I would run the batch loader with *only* that new profile
>> > > > >       definition to seed?
>> > > > >
>> > > > > Forgive me if I missed this in PR's and discussion in the FB,
but
>> how
>> > > do
>> > > > we
>> > > > > establish "tm" from 1.1 above? Any concerns about overlap or
gaps
>> > after
>> > > > the
>> > > > > seeding is performed?
>> > > > >
>> > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <nick@nickallen.org>
>> > > wrote:
>> > > > >
>> > > > > > I think more often than not, you would want to load your
profile
>> > > > > definition
>> > > > > > from a file.  This is why I considered the 'load from Zk'
more
>> of a
>> > > > > > nice-to-have.
>> > > > > >
>> > > > > >    - In use case 1 and 2, this would definitely be the case.
>> The
>> > > > > profiles
>> > > > > >    I am working with are speculative and I am using the
batch
>> > > profiler
>> > > > to
>> > > > > >    determine if they are worth keeping.  In this case, my
>> > speculative
>> > > > > > profiles
>> > > > > >    would not be in Zk (yet).
>> > > > > >    - In use case 3, I could see it go either way.  It might
be
>> > useful
>> > > > to
>> > > > > >    load from Zk, but it certainly isn't a blocker.
>> > > > > >
>> > > > > >
>> > > > > > > So if the config does not correctly match the profiler
config
>> > held
>> > > in
>> > > > > ZK
>> > > > > > and
>> > > > > > the user runs the batch seeding job, what happens?
>> > > > > >
>> > > > > > You would just get a profile that is slightly different
over the
>> > > entire
>> > > > > > time span.  This is not a new risk.  If the user changes
their
>> > > Profile
>> > > > > > definitions in Zk, the same thing would happen.
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic <
>> > > > > > michael.miklavcic@gmail.com> wrote:
>> > > > > >
>> > > > > > > I think I'm torn on this, specifically because it's
batch and
>> > would
>> > > > > > > generally be run as-needed. Justin, can you elaborate
on your
>> > > > concerns
>> > > > > > > there? This feels functionally very similar to our
flat file
>> > > loaders,
>> > > > > > which
>> > > > > > > all have inputs for config from the CLI only. On the
other
>> hand,
>> > > our
>> > > > > flat
>> > > > > > > file loaders are not typically seeding an existing
structure.
>> My
>> > > > > concern
>> > > > > > of
>> > > > > > > a local file profiler config stems from this stated
goal:
>> > > > > > > > The goal would be to enable “profile seeding”
which allows
>> > > profiles
>> > > > > to
>> > > > > > be
>> > > > > > > populated from a time before the profile was created.
>> > > > > > > So if the config does not correctly match the profiler
config
>> > held
>> > > in
>> > > > > ZK
>> > > > > > > and the user runs the batch seeding job, what happens?
>> > > > > > >
>> > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet <
>> > > justinjleet@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > The profile not being able to read from ZK feels
like a
>> fairly
>> > > > > > > substantial,
>> > > > > > > > if subtle, set of potential problems.  I'd like
to see that
>> in
>> > > > either
>> > > > > > > > before merging or at least pretty soon after merging.
 Is
>> it a
>> > > lot
>> > > > of
>> > > > > > > work
>> > > > > > > > to add that functionality based on where things
are right
>> now?
>> > > > > > > >
>> > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <
>> nick@nickallen.org
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Here is another limitation that I just thought.
It can
>> only
>> > > read
>> > > > a
>> > > > > > > > profile
>> > > > > > > > > definition from a file.  It probably also
makes sense to
>> add
>> > an
>> > > > > > option
>> > > > > > > > that
>> > > > > > > > > allows it to read the current Profiler configuration
from
>> > > > > Zookeeper.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > > Is it worth setting up a default config
that pulls from
>> the
>> > > > main
>> > > > > > > > indexing
>> > > > > > > > > output?
>> > > > > > > > >
>> > > > > > > > > Yes, I think that makes sense.  We want the
Batch
>> Profiler to
>> > > > point
>> > > > > > to
>> > > > > > > > the
>> > > > > > > > > right HDFS URL, no matter where/how Metron
is deployed.
>> When
>> > > > > Metron
>> > > > > > > gets
>> > > > > > > > > spun-up on a cluster, I should be able to
just run the
>> Batch
>> > > > > Profiler
>> > > > > > > > > without having to fuss with the input path.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet
<
>> > > > justinjleet@gmail.com
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Re:
>> > > > > > > > > >
>> > > > > > > > > > >  * You do not configure the Batch
Profiler in
>> Ambari.  It
>> > > is
>> > > > > > > > configured
>> > > > > > > > > > > and executed completely from the
command-line.
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > Is it worth setting up a default config
that pulls from
>> the
>> > > > main
>> > > > > > > > indexing
>> > > > > > > > > > output?  I'm a little on the fence about
it, but it
>> seems
>> > > like
>> > > > > > making
>> > > > > > > > the
>> > > > > > > > > > most common case more or less built-in
would be nice.
>> > > > > > > > > >
>> > > > > > > > > > Having said that, I do not consider
that a requirement
>> for
>> > > > > merging
>> > > > > > > the
>> > > > > > > > > > feature branch.
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James
Sirota <
>> > > > > jsirota@apache.org>
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > I think what you have outlined
above is a good initial
>> > stab
>> > > > at
>> > > > > > the
>> > > > > > > > > > > feature.  Manual install of spark
is not a big deal.
>> > > > > Configuring
>> > > > > > > via
>> > > > > > > > > > > command line while we mature this
feature is ok as
>> well.
>> > > > > Doesn't
>> > > > > > > > look
>> > > > > > > > > > like
>> > > > > > > > > > > configuration steps are too hard.
 I think you should
>> > > merge.
>> > > > > > > > > > >
>> > > > > > > > > > > James
>> > > > > > > > > > >
>> > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen"
<nick@nickallen.org>:
>> > > > > > > > > > > > I would like to open a discussion
to get the Batch
>> > > Profiler
>> > > > > > > feature
>> > > > > > > > > > > branch
>> > > > > > > > > > > > merged into master as part
of METRON-1699 [1] Create
>> > > Batch
>> > > > > > > > Profiler.
>> > > > > > > > > > All
>> > > > > > > > > > > > of the work that I had in
mind for our first draft
>> of
>> > the
>> > > > > Batch
>> > > > > > > > > > Profiler
>> > > > > > > > > > > > has been completed. Please
take a look through what
>> I
>> > > have
>> > > > > and
>> > > > > > > let
>> > > > > > > > me
>> > > > > > > > > > > know
>> > > > > > > > > > > > if there are other features
that you think are
>> required
>> > > > > > *before*
>> > > > > > > we
>> > > > > > > > > > > merge.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Previous list discussions
on this topic include [2]
>> and
>> > > > [3].
>> > > > > > > > > > > >
>> > > > > > > > > > > > (Q) What can I do with the
feature branch?
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * With the Batch Profiler,
you can backfill/seed
>> > > profiles
>> > > > > > using
>> > > > > > > > > > > archived
>> > > > > > > > > > > > telemetry. This enables the
following types of use
>> > cases.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       1. As a Security Data
Scientist, I want to
>> > > understand
>> > > > > the
>> > > > > > > > > > > historical
>> > > > > > > > > > > > behaviors and trends of a
profile that I have
>> created
>> > so
>> > > > > that I
>> > > > > > > can
>> > > > > > > > > > > > determine if I have created
a feature set that has
>> > > > predictive
>> > > > > > > value
>> > > > > > > > > for
>> > > > > > > > > > > > model building.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       2. As a Security Data
Scientist, I want to
>> > > understand
>> > > > > the
>> > > > > > > > > > > historical
>> > > > > > > > > > > > behaviors and trends of a
profile that I have
>> created
>> > so
>> > > > > that I
>> > > > > > > can
>> > > > > > > > > > > > determine if I have defined
the profile correctly
>> and
>> > > > > created a
>> > > > > > > > > feature
>> > > > > > > > > > > set
>> > > > > > > > > > > > that matches reality.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       3. As a Security Platform
Engineer, I want to
>> > > > generate
>> > > > > a
>> > > > > > > > > profile
>> > > > > > > > > > > > using archived telemetry when
I deploy a new model
>> to
>> > > > > > production
>> > > > > > > so
>> > > > > > > > > > that
>> > > > > > > > > > > > models depending on that profile
can function on
>> day 1.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * METRON-1699 [1] includes
a more detailed
>> > description
>> > > of
>> > > > > the
>> > > > > > > > > > feature.
>> > > > > > > > > > > >
>> > > > > > > > > > > > (Q) What work was completed?
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The Batch Profiler runs
on Spark and was
>> > implemented
>> > > in
>> > > > > > Java
>> > > > > > > to
>> > > > > > > > > > > remain
>> > > > > > > > > > > > consistent with our current
Java-heavy code base.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The Batch Profiler is
executed from the
>> > command-line.
>> > > > It
>> > > > > > can
>> > > > > > > be
>> > > > > > > > > > > > launched using a script or
by calling
>> `spark-submit`,
>> > > which
>> > > > > may
>> > > > > > > be
>> > > > > > > > > > useful
>> > > > > > > > > > > > for advanced users.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * Input telemetry can be
consumed from multiple
>> > > sources;
>> > > > > for
>> > > > > > > > > example
>> > > > > > > > > > > HDFS
>> > > > > > > > > > > > or the local file system.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * Input telemetry can be
consumed in multiple
>> > formats;
>> > > > for
>> > > > > > > > example
>> > > > > > > > > > JSON
>> > > > > > > > > > > > or ORC.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The 'output' profile measurements
are persisted
>> in
>> > > > HBase
>> > > > > > and
>> > > > > > > is
>> > > > > > > > > > > > consistent with the Storm
Profiler.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * It can be run on any underlying
engine
>> supported by
>> > > > > Spark.
>> > > > > > I
>> > > > > > > > have
>> > > > > > > > > > > > tested it both in 'local'
mode and on a YARN
>> cluster.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * It is installed automatically
by the Metron
>> MPack.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * A README was added that
documents usage
>> > instructions.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The existing Profiler
code was refactored so
>> that
>> > as
>> > > > much
>> > > > > > > code
>> > > > > > > > as
>> > > > > > > > > > > > possible is shared between
the 3 Profiler ports;
>> Storm,
>> > > the
>> > > > > > > Stellar
>> > > > > > > > > > REPL,
>> > > > > > > > > > > > and Spark. For example, the
logic which determines
>> the
>> > > > > > timestamp
>> > > > > > > > of a
>> > > > > > > > > > > > message was refactored so
that it could be reused by
>> > all
>> > > > > ports.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       * metron-profiler-common:
The common Profiler
>> > code
>> > > > > shared
>> > > > > > > > > amongst
>> > > > > > > > > > > > each port.
>> > > > > > > > > > > >       * metron-profiler-storm:
Profiler on Storm
>> > > > > > > > > > > >       * metron-profiler-spark:
Profiler on Spark
>> > > > > > > > > > > >       * metron-profiler-repl:
Profiler on the
>> Stellar
>> > > REPL
>> > > > > > > > > > > >       * metron-profiler-client:
The client code for
>> > > > > retrieving
>> > > > > > > > > profile
>> > > > > > > > > > > > data; for example PROFILE_GET.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * There are 3 separate RPM
and DEB packages now
>> > created
>> > > > for
>> > > > > > the
>> > > > > > > > > > > Profiler.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       * metron-profiler-storm-*.rpm
>> > > > > > > > > > > >       * metron-profiler-spark-*.rpm
>> > > > > > > > > > > >       * metron-profiler-repl-*.rpm
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * The Profiler integration
tests were enhanced to
>> > > > leverage
>> > > > > > the
>> > > > > > > > > > Profiler
>> > > > > > > > > > > > Client logic to validate the
results.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * Review METRON-1699 [1]
for a complete
>> break-down of
>> > > the
>> > > > > > tasks
>> > > > > > > > > that
>> > > > > > > > > > > have
>> > > > > > > > > > > > been completed on the feature
branch.
>> > > > > > > > > > > >
>> > > > > > > > > > > > (Q) What limitations exist?
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * You must manually install
Spark to use the Batch
>> > > > > Profiler.
>> > > > > > > The
>> > > > > > > > > > Metron
>> > > > > > > > > > > > MPack does not treat Spark
as a Metron dependency
>> and
>> > so
>> > > > does
>> > > > > > not
>> > > > > > > > > > install
>> > > > > > > > > > > > it automatically.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * You do not configure the
Batch Profiler in
>> Ambari.
>> > It
>> > > > is
>> > > > > > > > > configured
>> > > > > > > > > > > > and executed completely from
the command-line.
>> > > > > > > > > > > >
>> > > > > > > > > > > >   * To run the Batch Profiler
in 'Full Dev', you
>> have
>> > to
>> > > > take
>> > > > > > the
>> > > > > > > > > > > following
>> > > > > > > > > > > > manual steps. Some of these
are arguably limitations
>> > with
>> > > > how
>> > > > > > > > Ambari
>> > > > > > > > > > > > installs Spark 2 in the version
of HDP that we run.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       1. Install Spark 2 using
Ambari.
>> > > > > > > > > > > >
>> > > > > > > > > > > >       2. Tell Spark how to
talk with HBase.
>> > > > > > > > > > > >
>> > > > > > > > > > > >         SPARK_HOME=/usr/hdp/current/spark2-client
>> > > > > > > > > > > >         cp
>> > > > /usr/hdp/current/hbase-client/conf/hbase-site.xml
>> > > > > > > > > > > > $SPARK_HOME/conf/
>> > > > > > > > > > > >
>> > > > > > > > > > > >       3. Create the Spark
History directory in HDFS.
>> > > > > > > > > > > >
>> > > > > > > > > > > >         export HADOOP_USER_NAME=hdfs
>> > > > > > > > > > > >         hdfs dfs -mkdir /spark2-history
>> > > > > > > > > > > >
>> > > > > > > > > > > >       4. Change the default
input path to
>> > > > > > > > `hdfs://localhost:8020/...`
>> > > > > > > > > > to
>> > > > > > > > > > > > match the port defined by
HDP, instead of port 9000.
>> > > > > > > > > > > >
>> > > > > > > > > > > > [1]
>> https://issues.apache.org/jira/browse/METRON-1699
>> > > > > > > > > > > > [2]
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
>> > > > > > > > > > > > [3]
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E
>> > > > > > > > > > >
>> > > > > > > > > > > -------------------
>> > > > > > > > > > > Thank you,
>> > > > > > > > > > >
>> > > > > > > > > > > James Sirota
>> > > > > > > > > > > PMC- Apache Metron
>> > > > > > > > > > > jsirota AT apache DOT org
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message