metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject Re: [DISCUSS] Generating and Interacting with serialized summary objects
Date Thu, 04 Jan 2018 16:59:37 GMT
It also occurs to me that even in this situation, it's not a sufficient
generalization for just Bloom, but this is a bloom filter of the output of
the all the typosquatted domains for the domain in each row.  If we wanted
to hard code, we'd have to hard code specifically the bloom filter *for*
typosquatting use-case.  Hard coding this would prevent things like bloom
filters containing malicious IPs from a reference source, for instance.

On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella <cestella@gmail.com> wrote:

> So, there is value outside of just bloom usage.  The most specific example
> of this would be in order to configure a bloom filter, we need to know at
> least an upper bound of the number of items that are going to be added to
> the bloom filter.  In order to do that, we need to count the number of
> typosquatted domains.  Specifically at https://github.com/
> cestella/incubator-metron/tree/typosquat_merge/use-
> cases/typosquat_detection#configure-the-bloom-filter you can see how we
> use the CONSOLE writer with an extractor config to count the number of
> typosquatted domains in the alexa top 10k dataset so we can size the filter
> appropriately.
>
> I'd argue that other types of probabalistic data structures could also
> make sense here as well, like statistical sketches. Consider, for instance,
> a cheap and dirty DGA indicator where we take the Alexa top 1M and look at
> the distribution of shannon entropy in the domains.  If the shannon entropy
> of a domain going across metron is more than 5 std devs from the mean, that
> could be circumstantial evidence of a malicious attack.  This would yield a
> lot of false positives, but used in conjunction with other indicators it
> could be valuable.
>
> Computing that would be as follows:
>
> {
>   "config" : {
>     "columns" : {
>        "rank" : 0,
>        "domain" : 1
>     },
>     "value_transform" : {
>        "domain" : "DOMAIN_REMOVE_TLD(domain)"
>     },
>     "value_filter" : "LENGTH(domain) > 0",
>     "state_init" : "STATS_INIT()",
>     "state_update" : {
>        "state" : "STATS_ADD(state, STRING_ENTROPY(domain))"
>                      },
>     "state_merge" : "STATS_MERGE(states)",
>     "separator" : ","
>   },
>   "extractor" : "CSV"
> }
>
> Also, for another example, imagine a situation where we have a SPARK_SQL
> engine rather than just LOCAL for summarizing.  We could create a general
> summary of URL lengths in bro data which could be used for determining if
> someone is trying to send in very large URLs maliciously (see Jon Zeolla's
> concerns in https://issues.apache.org/jira/browse/METRON-517 for a
> discussion of this).  In order to do that, we could simply execute:
>
> $METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro" -o /tmp/reference/bro_uri_distribution.ser
-e ~/uri_length_extractor.json -p 5 -om HDFS -m SPARK_SQL
>
> with uri_length_extractor.json containing:
>
> {
>   "config" : {
>     "value_filter" : "LENGTH(uri) > 0",
>     "state_init" : "STATS_INIT()",
>     "state_update" : {
>        "state" : "STATS_ADD(state, LENGTH(uri))"
>                      },
>     "state_merge" : "STATS_MERGE(states)",
>     "separator" : ","
>   },
>   "extractor" : "SQL_ROW"
> }
>
>
> Regarding value filter, that's already around in the extractor config
> because of the need to transform data in the flatfile loader.  While I
> definitely see the desire to use unix tools to prep data, there are some
> things that aren't as easy to do.  For instance, here, removing the TLD of
> a domain is not a trivial task in a shell script and we have existing
> functions for that in Stellar.  I would see people using both.
>
> To address the issue of a more targeted experience to bloom, I think that
> sort of specialization should best exist in the UI layer.  Having a more
> complete and expressive backend reused across specific UIs seems to be the
> best of all worlds.  It allows power users to drop down and do more complex
> things and still provides a (mostly) code-free and targeted experience for
> users.  It seems to me that limiting the expressibility in the backend
> isn't the right way to go since this work just fits in with our existing
> engine.
>
>
> On Thu, Jan 4, 2018 at 1:40 AM, James Sirota <jsirota@apache.org> wrote:
>
>> I just went through these pull requests as well and also agree this is
>> good work.  I think it's a good first pass.  I would be careful with trying
>> to boil the ocean here.  I think for the initial use case I would only
>> support loading the bloom filters from HDFS.  If people want to pre-process
>> the CSV file of domains using awk or sed this should be out of scope of
>> this work.  It's easy enough to do out of band and I would not include any
>> of these functions at all.   I also think that the config could be
>> considerably simplified.  I think value_filter should be removed (since I
>> believe that preprocessing should be done by the user outside of this
>> process).  I also have a question about the init, update, and merge
>> configurations.  Would I ever initialize to anything but an empty bloom
>> filter?  For the state update would I ever do anything other than add to
>> the bloom filter?  For the state merge would I ever do anything other than
>> merge the states?  If the answer to these is 'no', then this should simply
>> be hard coded and not externalized into config values.
>>
>> 03.01.2018, 14:20, "Michael Miklavcic" <michael.miklavcic@gmail.com>:
>> > I just finished stepping through the typosquatting use case README in
>> your
>> > merge branch. This is really, really good work Casey. I see most of our
>> > previous documentation issues addressed up front, e.g. special variables
>> > are cited, all new fields explained, side effects documented. The use
>> case
>> > doc brings it all together soup-to-nuts and I think all the pieces make
>> > sense in a mostly self-contained way. I can't think of anything I had to
>> > sit and think about for more than a few seconds. I'll be making my way
>> > through your individual PR's in more detail, but my first impressions
>> are
>> > that this is excellent.
>> >
>> > On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
>> > michael.miklavcic@gmail.com> wrote:
>> >
>> >>  I'm liking this design and growth strategy, Casey. I also think Nick
>> and
>> >>  Otto have some valid points. I always find there's a natural tension
>> >>  between too little, just enough, and boiling the ocean and these
>> discuss
>> >>  threads really help drive what the short and long term visions should
>> look
>> >>  like.
>> >>
>> >>  On the subject of repositories and strategies, I agree that pluggable
>> >>  repos and strategies for modifying them would be useful. For the first
>> >>  pass, I'd really like to see HDFS with the proposed set of Stellar
>> >>  functions. This gives us a lot of bang for our buck - we can
>> capitalize on
>> >>  a set of powerful features around existence checking earlier without
>> having
>> >>  to worry about later interface changes impacting users. With the
>> primary
>> >>  interface coming through the JSON config, we are building a nice
>> facade
>> >>  that protects users from later implementation abstractions and
>> >>  improvements, all while providing a stable enough interface on which
>> we can
>> >>  develop UI features as desired. I'd be interested to hear more about
>> what
>> >>  features could be provided by a repository as time goes by.
>> Federation,
>> >>  permissions, governance, metadata management, perhaps?
>> >>
>> >>  I also had some concern over duplicating existing Unix features. I
>> think
>> >>  where I'm at has been largely addressed by Casey's comments on 1)
>> scaling,
>> >>  2) multiple variables, and 3) portability to Hadoop. Providing 2
>> approaches
>> >>  - 1 which is config-based and the other a composable set of functions
>> gives
>> >>  us the ability to provide a core set of features that can later be
>> easily
>> >>  expanded by users as the need arises. Here again I think the
>> prescribed
>> >>  approach provides a strong first pass that we can then expand on
>> without
>> >>  concern of future improvements becoming a hassle for end users.
>> >>
>> >>  Best,
>> >>  Mike
>> >>
>> >>  On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
>> >>  simon@simonellistonball.com> wrote:
>> >>
>> >>>  There is some really cool stuff happening here, if only I’d been
>> allowed
>> >>>  to see the lists over Christmas... :)
>> >>>
>> >>>  A few thoughts...
>> >>>
>> >>>  I like Otto’s generalisation of the problem to include specific local
>> >>>  stellar objects in a cache loaded from a store (HDFS seems a
>> natural, but
>> >>>  not only place, maybe even a web service / local microservicey object
>> >>>  provider!?) That said, I suspect that’s a good platform optimisation
>> >>>  approach. Should we look at this as a separate piece of work given
it
>> >>>  extends beyond the scope of the summarisation concept and ultimately
>> use it
>> >>>  as a back-end to feed the summarising engine proposed here for the
>> >>>  enrichment loader?
>> >>>
>> >>>  On the more specific use case, one think I would comment on is the
>> >>>  configuration approach. The iteration loop (state_{init|update|merge}
>> >>>  should be consistent with the way we handle things like the profiler
>> >>>  config, since it’s the same approach to data handling.
>> >>>
>> >>>  The other thing that seems to have crept in here is the interface to
>> >>>  something like Spark, which again, I am really very very keen on
>> seeing
>> >>>  happen. That said, not sure how that would happen in this context,
>> unless
>> >>>  you’re talking about pushing to something like livy for example
>> (eminently
>> >>>  sensible for things like cross instance caching and faster RPC-ish
>> access
>> >>>  to an existing spark context which seem to be what Casey is driving
>> at with
>> >>>  the spark piece.
>> >>>
>> >>>  To address the question of text manipulation in Stellar / metron
>> >>>  enrichment ingest etc, we already have this outside of the context
>> of the
>> >>>  issues here. I would argue that yes, we don’t want too many paths
>> for this,
>> >>>  and that maybe our parser approach might be heavily related to
>> text-based
>> >>>  ingest. I would say the scope worth dealing with here though is not
>> really
>> >>>  text manipulation, but summarisation, which is not well served by
>> existing
>> >>>  CLI tools like awk / sed and friends.
>> >>>
>> >>>  Simon
>> >>>
>> >>>  > On 3 Jan 2018, at 15:48, Nick Allen <nick@nickallen.org>
wrote:
>> >>>  >
>> >>>  >> Even with 5 threads, it takes an hour for the full Alexa 1m,
so I
>> >>>  think
>> >>>  > this will impact performance
>> >>>  >
>> >>>  > What exactly takes an hour? Adding 1M entries to a bloom filter?
>> That
>> >>>  > seems really high, unless I am not understanding something.
>> >>>  >
>> >>>  >
>> >>>  >
>> >>>  >
>> >>>  >
>> >>>  >
>> >>>  > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <cestella@gmail.com>
>> >>>  wrote:
>> >>>  >
>> >>>  >> Thanks for the feedback, Nick.
>> >>>  >>
>> >>>  >> Regarding "IMHO, I'd rather not reinvent the wheel for text
>> >>>  manipulation."
>> >>>  >>
>> >>>  >> I would argue that we are not reinventing the wheel for text
>> >>>  manipulation
>> >>>  >> as the extractor config exists already and we are doing a
similar
>> >>>  thing in
>> >>>  >> the flatfile loader (in fact, the code is reused and merely
>> extended).
>> >>>  >> Transformation operations are already supported in our codebase
>> in the
>> >>>  >> extractor config, this PR has just added some hooks for stateful
>> >>>  >> operations.
>> >>>  >>
>> >>>  >> Furthermore, we will need a configuration object to pass to
the
>> REST
>> >>>  call
>> >>>  >> if we are ever to create a UI around importing data into hbase
or
>> >>>  creating
>> >>>  >> these summary objects.
>> >>>  >>
>> >>>  >> Regarding your example:
>> >>>  >> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar
-i
>> >>>  >> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>> >>>  >>
>> >>>  >> I'm very sympathetic to this type of extension, but it has
some
>> issues:
>> >>>  >>
>> >>>  >> 1. This implies a single-threaded addition to the bloom filter.
>> >>>  >> 1. Even with 5 threads, it takes an hour for the full alexa
1m,
>> >>>  so I
>> >>>  >> think this will impact performance
>> >>>  >> 2. There's not a way to specify how to merge across threads
if we
>> >>>  do
>> >>>  >> make a multithread command line option
>> >>>  >> 2. This restricts these kinds of operations to roles with
heavy
>> unix
>> >>>  CLI
>> >>>  >> knowledge, which isn't often the types of people who would
be
>> doing
>> >>>  this
>> >>>  >> type of operation
>> >>>  >> 3. What if we need two variables passed to stellar?
>> >>>  >> 4. This approach will be harder to move to Hadoop. Eventually
we
>> >>>  will
>> >>>  >> want to support data on HDFS being processed by Hadoop (similar
to
>> >>>  >> flatfile
>> >>>  >> loader), so instead of -m LOCAL being passed for the flatfile
>> >>>  summarizer
>> >>>  >> you'd pass -m SPARK and the processing would happen on the
cluster
>> >>>  >> 1. This is particularly relevant in this case as it's a
>> >>>  >> embarrassingly parallel problem in general
>> >>>  >>
>> >>>  >> In summary, while this a CLI approach is attractive, I prefer
the
>> >>>  extractor
>> >>>  >> config solution because it is the solution with the smallest
>> iteration
>> >>>  >> that:
>> >>>  >>
>> >>>  >> 1. Reuses existing metron extraction infrastructure
>> >>>  >> 2. Provides the most solid base for the extensions that will
be
>> >>>  sorely
>> >>>  >> needed soon (and will keep it in parity with the flatfile
loader)
>> >>>  >> 3. Provides the most solid base for a future UI extension
in the
>> >>>  >> management UI to support both summarization and loading
>> >>>  >>
>> >>>  >>
>> >>>  >>
>> >>>  >>
>> >>>  >> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <nick@nickallen.org>
>> >>>  wrote:
>> >>>  >>
>> >>>  >>> First off, I really do like the typosquatting use case
and a lot
>> of
>> >>>  what
>> >>>  >>> you have described.
>> >>>  >>>
>> >>>  >>>> We need a way to generate the summary sketches from
flat data
>> for
>> >>>  this
>> >>>  >> to
>> >>>  >>>> work.
>> >>>  >>>> ​..​
>> >>>  >>>>
>> >>>  >>>
>> >>>  >>> I took this quote directly from your use case. Above is
the point
>> >>>  that
>> >>>  >> I'd
>> >>>  >>> like to discuss and what your proposed solutions center
on. This
>> is
>> >>>  >> what I
>> >>>  >>> think you are trying to do, at least with PR #879
>> >>>  >>> <https://github.com/apache/metron/pull/879>...
>> >>>  >>>
>> >>>  >>> (Q) Can we repurpose Stellar functions so that they can
operate
>> on
>> >>>  text
>> >>>  >>> stored in a file system?
>> >>>  >>>
>> >>>  >>>
>> >>>  >>> Whether we use the (1) Configuration or the (2) Function-based
>> >>>  approach
>> >>>  >>> that you described, fundamentally we are introducing new
ways to
>> >>>  perform
>> >>>  >>> text manipulation inside of Stellar.
>> >>>  >>>
>> >>>  >>> IMHO, I'd rather not reinvent the wheel for text manipulation.
It
>> >>>  would
>> >>>  >> be
>> >>>  >>> painful to implement and maintain a bunch of Stellar functions
>> for
>> >>>  text
>> >>>  >>> manipulation. People already have a large number of tools
>> available
>> >>>  to
>> >>>  >> do
>> >>>  >>> this and everyone has their favorites. People are resistant
to
>> >>>  learning
>> >>>  >>> something new when they already are familiar with another
way to
>> do
>> >>>  the
>> >>>  >>> same thing.
>> >>>  >>>
>> >>>  >>> So then the question is, how else can we do this? My suggestion
>> is
>> >>>  that
>> >>>  >>> rather than introducing text manipulation tools inside
of
>> Stellar, we
>> >>>  >> allow
>> >>>  >>> people to use the text manipulation tools they already
know, but
>> with
>> >>>  the
>> >>>  >>> Stellar functions that we already have. And the obvious
way to
>> tie
>> >>>  those
>> >>>  >>> two things together is the Unix pipeline.
>> >>>  >>>
>> >>>  >>> A quick, albeit horribly incomplete, example to flesh
this out a
>> bit
>> >>>  more
>> >>>  >>> based on the example you have in PR #879
>> >>>  >>> <https://github.com/apache/metron/pull/879>. This
would allow
>> me to
>> >>>  >>> integrate Stellar with whatever external tools that I
want.
>> >>>  >>>
>> >>>  >>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d'
| stellar
>> -i
>> >>>  >>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>> >>>  >>>
>> >>>  >>>
>> >>>  >>>
>> >>>  >>>
>> >>>  >>>
>> >>>  >>>
>> >>>  >>>
>> >>>  >>>
>> >>>  >>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <
>> cestella@gmail.com>
>> >>>  >> wrote:
>> >>>  >>>
>> >>>  >>>> I'll start this discussion off with my idea around
a 2nd step
>> that is
>> >>>  >>> more
>> >>>  >>>> adaptable. I propose the following set of stellar
functions
>> backed
>> >>>  by
>> >>>  >>>> Spark in the metron-management project:
>> >>>  >>>>
>> >>>  >>>> - CSV_PARSE(location, separator?, columns?) : Constructs
a Spark
>> >>>  >>>> Dataframe for reading the flatfile
>> >>>  >>>> - SQL_TRANSFORM(dataframe, spark sql statement): Transforms
the
>> >>>  >>>> dataframe
>> >>>  >>>> - SUMMARIZE(state_init, state_update, state_merge):
Summarize
>> the
>> >>>  >>>> dataframe using the lambda functions:
>> >>>  >>>> - state_init - executed once per worker to initialize
the state
>> >>>  >>>> - state_update - executed once per row
>> >>>  >>>> - state_merge - Merge the worker states into one worker
state
>> >>>  >>>> - OBJECT_SAVE(obj, output_path) : Save the object
obj to the
>> path
>> >>>  >>>> output_path on HDFS.
>> >>>  >>>>
>> >>>  >>>> This would enable more flexibility and composibility
than the
>> >>>  >>>> configuration-based approach that we have in the flatfile
>> loader.
>> >>>  >>>> My concern with this approach, and the reason I didn't
do it
>> >>>  initially,
>> >>>  >>> was
>> >>>  >>>> that I think that users will want at least 2 ways
to summarize
>> data
>> >>>  (or
>> >>>  >>>> load data):
>> >>>  >>>>
>> >>>  >>>> - A configuration based approach, which enables a
UI
>> >>>  >>>> - A set of stellar functions via the scriptable REPL
>> >>>  >>>>
>> >>>  >>>> I would argue that both have a place and I started
with the
>> >>>  >> configuration
>> >>>  >>>> based approach as it was a more natural extension
of what we
>> already
>> >>>  >> had.
>> >>>  >>>> I'd love to hear thoughts about this idea too.
>> >>>  >>>>
>> >>>  >>>>
>> >>>  >>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <
>> cestella@gmail.com>
>> >>>  >>> wrote:
>> >>>  >>>>
>> >>>  >>>>> Hi all,
>> >>>  >>>>>
>> >>>  >>>>> I wanted to get some feedback on a sensible plan
for
>> something. It
>> >>>  >>>>> occurred to me the other day when considering
the use-case of
>> >>>  >> detecting
>> >>>  >>>>> typosquatted domains, that one approach was to
generate the
>> set of
>> >>>  >>>>> typosquatted domains for some set of reference
domains and
>> compare
>> >>>  >>>> domains
>> >>>  >>>>> as they flow through.
>> >>>  >>>>>
>> >>>  >>>>> One way we could do this would be to generate
this data and
>> import
>> >>>  >> the
>> >>>  >>>>> typosquatted domains into HBase. I thought, however,
that
>> another
>> >>>  >>>> approach
>> >>>  >>>>> which may trade-off accuracy to remove the network
hop and
>> potential
>> >>>  >>> disk
>> >>>  >>>>> seek by constructing a bloom filter that includes
the set of
>> >>>  >>> typosquatted
>> >>>  >>>>> domains.
>> >>>  >>>>>
>> >>>  >>>>> The challenge was that we don't have a way to
do this
>> currently. We
>> >>>  >>> do,
>> >>>  >>>>> however, have a loading infrastructure (e.g. the
>> flatfile_loader)
>> >>>  and
>> >>>  >>>>> configuration (see https://github.com/apache/
>> >>>  >>> metron/tree/master/metron-
>> >>>  >>>>> platform/metron-data-management#common-extractor-properties)
>> which
>> >>>  >>>>> handles:
>> >>>  >>>>>
>> >>>  >>>>> - parsing flat files
>> >>>  >>>>> - transforming the rows
>> >>>  >>>>> - filtering the rows
>> >>>  >>>>>
>> >>>  >>>>> To enable the new use-case of generating a summary
object
>> (e.g. a
>> >>>  >> bloom
>> >>>  >>>>> filter), in METRON-1378 (https://github.com/apache/met
>> ron/pull/879)
>> >>>  >> I
>> >>>  >>>>> propose that we create a new utility that uses
the same
>> extractor
>> >>>  >>> config
>> >>>  >>>>> add the ability to:
>> >>>  >>>>>
>> >>>  >>>>> - initialize a state object
>> >>>  >>>>> - update the object for every row
>> >>>  >>>>> - merge the state objects (in the case of multiple
threads, in
>> the
>> >>>  >>>>> case of one thread it's not needed).
>> >>>  >>>>>
>> >>>  >>>>> I think this is a sensible decision because:
>> >>>  >>>>>
>> >>>  >>>>> - It's a minimal movement from the flat file loader
>> >>>  >>>>> - Uses the same configs
>> >>>  >>>>> - Abstracts and reuses the existing infrastructure
>> >>>  >>>>> - Having one extractor config means that it should
be easier to
>> >>>  >>>>> generate a UI around this to simplify the experience
>> >>>  >>>>>
>> >>>  >>>>> All that being said, our extractor config is..shall
we
>> >>>  say...daunting
>> >>>  >>> :).
>> >>>  >>>>> I am sensitive to the fact that this adds to an
existing
>> difficult
>> >>>  >>>> config.
>> >>>  >>>>> I propose that this is an initial step forward
to support the
>> >>>  >> use-case
>> >>>  >>>> and
>> >>>  >>>>> we can enable something more composable going
forward. My
>> concern
>> >>>  in
>> >>>  >>>>> considering this as the first step was that it
felt that the
>> >>>  >> composable
>> >>>  >>>>> units for data transformation and manipulation
suddenly takes
>> us
>> >>>  >> into a
>> >>>  >>>>> place where Stellar starts to look like Pig or
Spark RDD API. I
>> >>>  >> wasn't
>> >>>  >>>>> ready for that without a lot more discussion.
>> >>>  >>>>>
>> >>>  >>>>> To summarize, what I'd like to get from the community
is, after
>> >>>  >>> reviewing
>> >>>  >>>>> the entire use-case at https://github.com/cestella/
>> >>>  >>>> incubator-metron/tree/
>> >>>  >>>>> typosquat_merge/use-cases/typosquat_detection:
>> >>>  >>>>>
>> >>>  >>>>> - Is this so confusing that it does not belong
in Metron even
>> as a
>> >>>  >>>>> first-step?
>> >>>  >>>>> - Is there a way to extend the extractor config
in a less
>> >>>  >> confusing
>> >>>  >>>>> way to enable this?
>> >>>  >>>>>
>> >>>  >>>>> I apologize for making the discuss thread *after*
the JIRAs,
>> but I
>> >>>  >> felt
>> >>>  >>>>> this one might bear having some working code to
consider.
>> >>>  >>>>>
>> >>>  >>>>
>> >>>  >>>
>> >>>  >>
>>
>> -------------------
>> Thank you,
>>
>> James Sirota
>> PMC- Apache Metron
>> jsirota AT apache DOT org
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message