metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otto Fowler <ottobackwa...@gmail.com>
Subject Re: [DISCUSS] Generating and Interacting with serialized summary objects
Date Fri, 05 Jan 2018 21:08:23 GMT
Yes, abstracted.

We have an api of stellar functions that just load things from the store,
they don’t need to bleed through what the store is.
We have a ‘store’, which may be hdfs or hbase or whatever.
We have an api for adding to the store ( add<TYPE> etc ) that doesn’t
presume the store either.
Then we can have whatever long or short term hard to configure thing to
push to the store that we can imagine.




On January 5, 2018 at 14:16:52, Michael Miklavcic (
michael.miklavcic@gmail.com) wrote:

I'm not sure I follow what you're saying as it pertains to summary objects.
Repository is a loaded term, and I'm very apprehensive of pushing for
something potentially very complex where a simpler solution would suffice
in the short term. To wit, the items I'm seeing in this use case doc -
https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection
- don't preclude the 4 capabilities you've enumerated. Am I missing
something, or can you provide more context? My best guess is that rather
than referring to a specific HDFS path for a serialized object, you're
suggesting we provide a more abstract method for serializing/deserializing
objects to/from a variety of sources. Am I in the ballpark? I'd be in favor
of expanding functionality for such a thing provided a sensible default (ie
HDFS) is provided in the short-term.

On Fri, Jan 5, 2018 at 8:26 AM, Otto Fowler <ottobackwards@gmail.com>
wrote:

> If we separate the concerns as I have state previously :
>
> 1. Stellar can load objects into ‘caches’ from some repository and refer
to
> them.
> 2. The repositories
> 3. Some number of strategies to populate and possibly update the
> repository, from spark,
> to MR jobs to whatever you would classify the flat file stuff as.
> 4. Let the Stellar API for everything but LOAD() follow after we get
usage
>
> Then the particulars of ‘3’ are less important.
>
>
>
> On January 5, 2018 at 09:02:41, Justin Leet (justinjleet@gmail.com)
wrote:
>
> I agree with the general sentiment that we can tailor specific use cases
> via UI, and I'm worried that the use case specific solution (particularly
> in light of the note that it's not even general to the class of bloom
> filter problems, let alone an actually general problem) becomes more work
> than this as soon as about 2 more uses cases actually get realized.
> Pushing that to the UI lets people solve a variety of problems if they
> really want to dig in, while still giving flexibility to provide a more
> tailored experience for what we discover the 80% cases are in practice.
>
> Keeping in mind I am mostly unfamiliar with the extractor config itself,
I
> am wondering if it makes sense to split up the config a bit. While a lot
> of implementation details are shared, maybe the extractor config itself
> should be refactored into a couple parts analogous to ETL (as a follow on
> task, I think if this is true, it predates Casey's proposed change). It
> doesn't necessarily make it less complex, but it might make it more
easily
> digestible if it's split up by idea (parsing, transformation, etc.).
>
> Re: Mike's point, I don't think we want the actual processing broken up
as
> ETL, but the representation to the user in terms of configuration could
be
> similar (Since we're already doing parsing and transformation). We don't
> have to implement it as an ETL pipeline, but it does potentially offer
the
> user a way to quickly grasp what the JSON blob is actually specifying.
> Making it easy to understand, even if it's not the ideal way to interact
is
> potentially still a win.
>
> On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> michael.miklavcic@gmail.com> wrote:
>
> > I mentioned this earlier, but I'll reiterate that I think this approach
> > gives us the ability to make specific use cases via a UI, or other
> > interface should we choose to add one, while keeping the core adaptable
> and
> > flexible. This is ideal for middle tier as I think this effectively
gives
> > us the ability to pivot to other use cases very easily while not being
so
> > generic as to be useless. The fact that you were able to create this as
> > quickly as you did seems to me directly related to the fact we made the
> > decision to keep the loader somewhat flexible rather than very
specific.
> > The operation ordering and state carry from one phase of processing to
> the
> > next would simply have been inscrutable, if not impossible, with a CLI
> > option-only approach. Sure, it's not as simple as "put infile.txt
> > outfile.txt", but the alternatives are not that clear either. One might
> > argue we could split up the processing pieces as in traditional Hadoop,
> eg
> > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform.
> But
> > quite frankly that's going in the *opposite* direction I think we want
> > here. That's more complex in terms of moving parts. The config approach
> > with pluggable Stellar insulates users from specific implementations,
but
> > also gives you the ability to pass lower level constructs, eg Spark SQL
> or
> > HiveQL, should the need arise.
> >
> > In summary, my impressions are that at this point the features and
level
> of
> > abstraction feel appropriate to me. I think it buys us 1) learning from
a
> > starting typosquatting use case, 2) flexibility to change and adapt it
> > without affecting users, and 3) enough concrete capability to make more
> > specific use cases easy to deliver with a UI.
> >
> > Cheers,
> > Mike
> >
> > On Jan 4, 2018 9:59 AM, "Casey Stella" <cestella@gmail.com> wrote:
> >
> > > It also occurs to me that even in this situation, it's not a
sufficient
> > > generalization for just Bloom, but this is a bloom filter of the
output
> > of
> > > the all the typosquatted domains for the domain in each row. If we
> > wanted
> > > to hard code, we'd have to hard code specifically the bloom filter
> *for*
> > > typosquatting use-case. Hard coding this would prevent things like
> bloom
> > > filters containing malicious IPs from a reference source, for
instance.
> > >
> > > On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella <cestella@gmail.com>
> > wrote:
> > >
> > > > So, there is value outside of just bloom usage. The most specific
> > > example
> > > > of this would be in order to configure a bloom filter, we need to
> know
> > at
> > > > least an upper bound of the number of items that are going to be
> added
> > to
> > > > the bloom filter. In order to do that, we need to count the number
of
> > > > typosquatted domains. Specifically at https://github.com/
> > > > cestella/incubator-metron/tree/typosquat_merge/use-
> > > > cases/typosquat_detection#configure-the-bloom-filter you can see
how
> > we
> > > > use the CONSOLE writer with an extractor config to count the number
> of
> > > > typosquatted domains in the alexa top 10k dataset so we can size
the
> > > filter
> > > > appropriately.
> > > >
> > > > I'd argue that other types of probabalistic data structures could
> also
> > > > make sense here as well, like statistical sketches. Consider, for
> > > instance,
> > > > a cheap and dirty DGA indicator where we take the Alexa top 1M and
> look
> > > at
> > > > the distribution of shannon entropy in the domains. If the shannon
> > > entropy
> > > > of a domain going across metron is more than 5 std devs from the
> mean,
> > > that
> > > > could be circumstantial evidence of a malicious attack. This would
> > > yield a
> > > > lot of false positives, but used in conjunction with other
indicators
> > it
> > > > could be valuable.
> > > >
> > > > Computing that would be as follows:
> > > >
> > > > {
> > > > "config" : {
> > > > "columns" : {
> > > > "rank" : 0,
> > > > "domain" : 1
> > > > },
> > > > "value_transform" : {
> > > > "domain" : "DOMAIN_REMOVE_TLD(domain)"
> > > > },
> > > > "value_filter" : "LENGTH(domain) > 0",
> > > > "state_init" : "STATS_INIT()",
> > > > "state_update" : {
> > > > "state" : "STATS_ADD(state, STRING_ENTROPY(domain))"
> > > > },
> > > > "state_merge" : "STATS_MERGE(states)",
> > > > "separator" : ","
> > > > },
> > > > "extractor" : "CSV"
> > > > }
> > > >
> > > > Also, for another example, imagine a situation where we have a
> > SPARK_SQL
> > > > engine rather than just LOCAL for summarizing. We could create a
> > general
> > > > summary of URL lengths in bro data which could be used for
> determining
> > if
> > > > someone is trying to send in very large URLs maliciously (see Jon
> > > Zeolla's
> > > > concerns in https://issues.apache.org/jira/browse/METRON-517 for a
> > > > discussion of this). In order to do that, we could simply execute:
> > > >
> > > > $METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro" -o
> > > /tmp/reference/bro_uri_distribution.ser -e
~/uri_length_extractor.json
> > -p
> > > 5 -om HDFS -m SPARK_SQL
> > > >
> > > > with uri_length_extractor.json containing:
> > > >
> > > > {
> > > > "config" : {
> > > > "value_filter" : "LENGTH(uri) > 0",
> > > > "state_init" : "STATS_INIT()",
> > > > "state_update" : {
> > > > "state" : "STATS_ADD(state, LENGTH(uri))"
> > > > },
> > > > "state_merge" : "STATS_MERGE(states)",
> > > > "separator" : ","
> > > > },
> > > > "extractor" : "SQL_ROW"
> > > > }
> > > >
> > > >
> > > > Regarding value filter, that's already around in the extractor
config
> > > > because of the need to transform data in the flatfile loader. While
I
> > > > definitely see the desire to use unix tools to prep data, there are
> > some
> > > > things that aren't as easy to do. For instance, here, removing the
> TLD
> > > of
> > > > a domain is not a trivial task in a shell script and we have
existing
> > > > functions for that in Stellar. I would see people using both.
> > > >
> > > > To address the issue of a more targeted experience to bloom, I
think
> > that
> > > > sort of specialization should best exist in the UI layer. Having a
> > more
> > > > complete and expressive backend reused across specific UIs seems to
> be
> > > the
> > > > best of all worlds. It allows power users to drop down and do more
> > > complex
> > > > things and still provides a (mostly) code-free and targeted
> experience
> > > for
> > > > users. It seems to me that limiting the expressibility in the
backend
> > > > isn't the right way to go since this work just fits in with our
> > existing
> > > > engine.
> > > >
> > > >
> > > > On Thu, Jan 4, 2018 at 1:40 AM, James Sirota <jsirota@apache.org>
> > wrote:
> > > >
> > > >> I just went through these pull requests as well and also agree
this
> is
> > > >> good work. I think it's a good first pass. I would be careful with
> > > trying
> > > >> to boil the ocean here. I think for the initial use case I would
> only
> > > >> support loading the bloom filters from HDFS. If people want to
> > > pre-process
> > > >> the CSV file of domains using awk or sed this should be out of
scope
> > of
> > > >> this work. It's easy enough to do out of band and I would not
> include
> > > any
> > > >> of these functions at all. I also think that the config could be
> > > >> considerably simplified. I think value_filter should be removed
> > (since
> > > I
> > > >> believe that preprocessing should be done by the user outside of
> this
> > > >> process). I also have a question about the init, update, and merge
> > > >> configurations. Would I ever initialize to anything but an empty
> > bloom
> > > >> filter? For the state update would I ever do anything other than
add
> > to
> > > >> the bloom filter? For the state merge would I ever do anything
other
> > > than
> > > >> merge the states? If the answer to these is 'no', then this should
> > > simply
> > > >> be hard coded and not externalized into config values.
> > > >>
> > > >> 03.01.2018, 14:20, "Michael Miklavcic" <michael.miklavcic@gmail.com
> >:
>
> > > >> > I just finished stepping through the typosquatting use case
README
> > in
> > > >> your
> > > >> > merge branch. This is really, really good work Casey. I see most
> of
> > > our
> > > >> > previous documentation issues addressed up front, e.g. special
> > > variables
> > > >> > are cited, all new fields explained, side effects documented.
The
> > use
> > > >> case
> > > >> > doc brings it all together soup-to-nuts and I think all the
pieces
> > > make
> > > >> > sense in a mostly self-contained way. I can't think of anything
I
> > had
> > > to
> > > >> > sit and think about for more than a few seconds. I'll be making
my
> > way
> > > >> > through your individual PR's in more detail, but my first
> > impressions
> > > >> are
> > > >> > that this is excellent.
> > > >> >
> > > >> > On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
> > > >> > michael.miklavcic@gmail.com> wrote:
> > > >> >
> > > >> >> I'm liking this design and growth strategy, Casey. I also
think
> > Nick
> > > >> and
> > > >> >> Otto have some valid points. I always find there's a natural
> > tension
> > > >> >> between too little, just enough, and boiling the ocean and
these
> > > >> discuss
> > > >> >> threads really help drive what the short and long term visions
> > > should
> > > >> look
> > > >> >> like.
> > > >> >>
> > > >> >> On the subject of repositories and strategies, I agree that
> > > pluggable
> > > >> >> repos and strategies for modifying them would be useful.
For
the
> > > first
> > > >> >> pass, I'd really like to see HDFS with the proposed set of
> Stellar
> > > >> >> functions. This gives us a lot of bang for our buck - we
can
> > > >> capitalize on
> > > >> >> a set of powerful features around existence checking earlier
> > without
> > > >> having
> > > >> >> to worry about later interface changes impacting users. With
the
> > > >> primary
> > > >> >> interface coming through the JSON config, we are building
a
nice
> > > >> facade
> > > >> >> that protects users from later implementation abstractions
and
> > > >> >> improvements, all while providing a stable enough interface
on
> > which
> > > >> we can
> > > >> >> develop UI features as desired. I'd be interested to hear
more
> > about
> > > >> what
> > > >> >> features could be provided by a repository as time goes by.
> > > >> Federation,
> > > >> >> permissions, governance, metadata management, perhaps?
> > > >> >>
> > > >> >> I also had some concern over duplicating existing Unix
features.
> I
> > > >> think
> > > >> >> where I'm at has been largely addressed by Casey's comments
on
1)
> > > >> scaling,
> > > >> >> 2) multiple variables, and 3) portability to Hadoop. Providing
2
> > > >> approaches
> > > >> >> - 1 which is config-based and the other a composable set
of
> > > functions
> > > >> gives
> > > >> >> us the ability to provide a core set of features that can
later
> be
> > > >> easily
> > > >> >> expanded by users as the need arises. Here again I think
the
> > > >> prescribed
> > > >> >> approach provides a strong first pass that we can then expand
on
> > > >> without
> > > >> >> concern of future improvements becoming a hassle for end
users.
> > > >> >>
> > > >> >> Best,
> > > >> >> Mike
> > > >> >>
> > > >> >> On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
> > > >> >> simon@simonellistonball.com> wrote:
> > > >> >>
> > > >> >>> There is some really cool stuff happening here, if only
I’d
been
> > > >> allowed
> > > >> >>> to see the lists over Christmas... :)
> > > >> >>>
> > > >> >>> A few thoughts...
> > > >> >>>
> > > >> >>> I like Otto’s generalisation of the problem to include
specific
> > > local
> > > >> >>> stellar objects in a cache loaded from a store (HDFS
seems a
> > > >> natural, but
> > > >> >>> not only place, maybe even a web service / local microservicey
> > > object
> > > >> >>> provider!?) That said, I suspect that’s a good platform
> > > optimisation
> > > >> >>> approach. Should we look at this as a separate piece
of work
> > given
> > > it
> > > >> >>> extends beyond the scope of the summarisation concept
and
> > > ultimately
> > > >> use it
> > > >> >>> as a back-end to feed the summarising engine proposed
here for
> > the
> > > >> >>> enrichment loader?
> > > >> >>>
> > > >> >>> On the more specific use case, one think I would comment
on is
> > the
> > > >> >>> configuration approach. The iteration loop
> > > (state_{init|update|merge}
> > > >> >>> should be consistent with the way we handle things like
the
> > > profiler
> > > >> >>> config, since it’s the same approach to data handling.
> > > >> >>>
> > > >> >>> The other thing that seems to have crept in here is the
> interface
> > > to
> > > >> >>> something like Spark, which again, I am really very very
keen
on
> > > >> seeing
> > > >> >>> happen. That said, not sure how that would happen in
this
> > context,
> > > >> unless
> > > >> >>> you’re talking about pushing to something like livy
for
example
> > > >> (eminently
> > > >> >>> sensible for things like cross instance caching and faster
> > RPC-ish
> > > >> access
> > > >> >>> to an existing spark context which seem to be what Casey
is
> > driving
> > > >> at with
> > > >> >>> the spark piece.
> > > >> >>>
> > > >> >>> To address the question of text manipulation in Stellar
/
metron
> > > >> >>> enrichment ingest etc, we already have this outside of
the
> > context
> > > >> of the
> > > >> >>> issues here. I would argue that yes, we don’t want
too many
> paths
> > > >> for this,
> > > >> >>> and that maybe our parser approach might be heavily related
to
> > > >> text-based
> > > >> >>> ingest. I would say the scope worth dealing with here
though
is
> > not
> > > >> really
> > > >> >>> text manipulation, but summarisation, which is not well
served
> by
> > > >> existing
> > > >> >>> CLI tools like awk / sed and friends.
> > > >> >>>
> > > >> >>> Simon
> > > >> >>>
> > > >> >>> > On 3 Jan 2018, at 15:48, Nick Allen <nick@nickallen.org>
> > wrote:
> > > >> >>> >
> > > >> >>> >> Even with 5 threads, it takes an hour for the
full Alexa
1m,
> > so
> > > I
> > > >> >>> think
> > > >> >>> > this will impact performance
> > > >> >>> >
> > > >> >>> > What exactly takes an hour? Adding 1M entries to
a bloom
> > filter?
> > > >> That
> > > >> >>> > seems really high, unless I am not understanding
something.
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <
> > > cestella@gmail.com>
> > > >> >>> wrote:
> > > >> >>> >
> > > >> >>> >> Thanks for the feedback, Nick.
> > > >> >>> >>
> > > >> >>> >> Regarding "IMHO, I'd rather not reinvent the
wheel for text
> > > >> >>> manipulation."
> > > >> >>> >>
> > > >> >>> >> I would argue that we are not reinventing the
wheel for
text
> > > >> >>> manipulation
> > > >> >>> >> as the extractor config exists already and we
are doing a
> > > similar
> > > >> >>> thing in
> > > >> >>> >> the flatfile loader (in fact, the code is reused
and merely
> > > >> extended).
> > > >> >>> >> Transformation operations are already supported
in our
> > codebase
> > > >> in the
> > > >> >>> >> extractor config, this PR has just added some
hooks for
> > stateful
> > > >> >>> >> operations.
> > > >> >>> >>
> > > >> >>> >> Furthermore, we will need a configuration object
to pass to
> > the
> > > >> REST
> > > >> >>> call
> > > >> >>> >> if we are ever to create a UI around importing
data into
> hbase
> > > or
> > > >> >>> creating
> > > >> >>> >> these summary objects.
> > > >> >>> >>
> > > >> >>> >> Regarding your example:
> > > >> >>> >> $ cat top-1m.csv | awk -F, '{print $2}' | sed
'/^$/d' |
> > stellar
> > > -i
> > > >> >>> >> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > > >> >>> >>
> > > >> >>> >> I'm very sympathetic to this type of extension,
but it has
> > some
> > > >> issues:
> > > >> >>> >>
> > > >> >>> >> 1. This implies a single-threaded addition to
the bloom
> > filter.
> > > >> >>> >> 1. Even with 5 threads, it takes an hour for
the full alexa
> > 1m,
> > > >> >>> so I
> > > >> >>> >> think this will impact performance
> > > >> >>> >> 2. There's not a way to specify how to merge
across threads
> if
> > > we
> > > >> >>> do
> > > >> >>> >> make a multithread command line option
> > > >> >>> >> 2. This restricts these kinds of operations
to roles with
> > heavy
> > > >> unix
> > > >> >>> CLI
> > > >> >>> >> knowledge, which isn't often the types of people
who would
be
> > > >> doing
> > > >> >>> this
> > > >> >>> >> type of operation
> > > >> >>> >> 3. What if we need two variables passed to stellar?
> > > >> >>> >> 4. This approach will be harder to move to Hadoop.
Eventually
> > we
> > > >> >>> will
> > > >> >>> >> want to support data on HDFS being processed
by Hadoop
> > (similar
> > > to
> > > >> >>> >> flatfile
> > > >> >>> >> loader), so instead of -m LOCAL being passed
for the
flatfile
> > > >> >>> summarizer
> > > >> >>> >> you'd pass -m SPARK and the processing would
happen on the
> > > cluster
> > > >> >>> >> 1. This is particularly relevant in this case
as it's a
> > > >> >>> >> embarrassingly parallel problem in general
> > > >> >>> >>
> > > >> >>> >> In summary, while this a CLI approach is attractive,
I
prefer
> > > the
> > > >> >>> extractor
> > > >> >>> >> config solution because it is the solution with
the
smallest
> > > >> iteration
> > > >> >>> >> that:
> > > >> >>> >>
> > > >> >>> >> 1. Reuses existing metron extraction infrastructure
> > > >> >>> >> 2. Provides the most solid base for the extensions
that
will
> > be
> > > >> >>> sorely
> > > >> >>> >> needed soon (and will keep it in parity with
the flatfile
> > > loader)
> > > >> >>> >> 3. Provides the most solid base for a future
UI extension
in
> > the
> > > >> >>> >> management UI to support both summarization
and loading
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen
<
> > > nick@nickallen.org>
> > > >> >>> wrote:
> > > >> >>> >>
> > > >> >>> >>> First off, I really do like the typosquatting
use case and
a
> > > lot
> > > >> of
> > > >> >>> what
> > > >> >>> >>> you have described.
> > > >> >>> >>>
> > > >> >>> >>>> We need a way to generate the summary
sketches from flat
> > data
> > > >> for
> > > >> >>> this
> > > >> >>> >> to
> > > >> >>> >>>> work.
> > > >> >>> >>>> ​..​
> > > >> >>> >>>>
> > > >> >>> >>>
> > > >> >>> >>> I took this quote directly from your use
case. Above is
the
> > > point
> > > >> >>> that
> > > >> >>> >> I'd
> > > >> >>> >>> like to discuss and what your proposed solutions
center
on.
> > > This
> > > >> is
> > > >> >>> >> what I
> > > >> >>> >>> think you are trying to do, at least with
PR #879
> > > >> >>> >>> <https://github.com/apache/metron/pull/879>...
> > > >> >>> >>>
> > > >> >>> >>> (Q) Can we repurpose Stellar functions so
that they can
> > operate
> > > >> on
> > > >> >>> text
> > > >> >>> >>> stored in a file system?
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>> Whether we use the (1) Configuration or
the (2)
> > Function-based
> > > >> >>> approach
> > > >> >>> >>> that you described, fundamentally we are
introducing new
> ways
> > > to
> > > >> >>> perform
> > > >> >>> >>> text manipulation inside of Stellar.
> > > >> >>> >>>
> > > >> >>> >>> IMHO, I'd rather not reinvent the wheel
for text
> > manipulation.
> > > It
> > > >> >>> would
> > > >> >>> >> be
> > > >> >>> >>> painful to implement and maintain a bunch
of Stellar
> > functions
> > > >> for
> > > >> >>> text
> > > >> >>> >>> manipulation. People already have a large
number of tools
> > > >> available
> > > >> >>> to
> > > >> >>> >> do
> > > >> >>> >>> this and everyone has their favorites. People
are
resistant
> > to
> > > >> >>> learning
> > > >> >>> >>> something new when they already are familiar
with another
> way
> > > to
> > > >> do
> > > >> >>> the
> > > >> >>> >>> same thing.
> > > >> >>> >>>
> > > >> >>> >>> So then the question is, how else can we
do this? My
> > suggestion
> > > >> is
> > > >> >>> that
> > > >> >>> >>> rather than introducing text manipulation
tools inside of
> > > >> Stellar, we
> > > >> >>> >> allow
> > > >> >>> >>> people to use the text manipulation tools
they already
know,
> > > but
> > > >> with
> > > >> >>> the
> > > >> >>> >>> Stellar functions that we already have.
And the obvious
way
> > to
> > > >> tie
> > > >> >>> those
> > > >> >>> >>> two things together is the Unix pipeline.
> > > >> >>> >>>
> > > >> >>> >>> A quick, albeit horribly incomplete, example
to flesh this
> > out
> > > a
> > > >> bit
> > > >> >>> more
> > > >> >>> >>> based on the example you have in PR #879
> > > >> >>> >>> <https://github.com/apache/metron/pull/879>.
This would
> > allow
> > > >> me to
> > > >> >>> >>> integrate Stellar with whatever external
tools that I
want.
> > > >> >>> >>>
> > > >> >>> >>> $ cat top-1m.csv | awk -F, '{print $2}'
| sed '/^$/d' |
> > stellar
> > > >> -i
> > > >> >>> >>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella
<
> > > >> cestella@gmail.com>
> > > >> >>> >> wrote:
> > > >> >>> >>>
> > > >> >>> >>>> I'll start this discussion off with
my idea around a 2nd
> > step
> > > >> that is
> > > >> >>> >>> more
> > > >> >>> >>>> adaptable. I propose the following set
of stellar
functions
> > > >> backed
> > > >> >>> by
> > > >> >>> >>>> Spark in the metron-management project:
> > > >> >>> >>>>
> > > >> >>> >>>> - CSV_PARSE(location, separator?, columns?)
: Constructs
a
> > > Spark
> > > >> >>> >>>> Dataframe for reading the flatfile
> > > >> >>> >>>> - SQL_TRANSFORM(dataframe, spark sql
statement):
Transforms
> > > the
> > > >> >>> >>>> dataframe
> > > >> >>> >>>> - SUMMARIZE(state_init, state_update,
state_merge):
> > Summarize
> > > >> the
> > > >> >>> >>>> dataframe using the lambda functions:
> > > >> >>> >>>> - state_init - executed once per worker
to initialize the
> > > state
> > > >> >>> >>>> - state_update - executed once per row
> > > >> >>> >>>> - state_merge - Merge the worker states
into one worker
> > state
> > > >> >>> >>>> - OBJECT_SAVE(obj, output_path) : Save
the object obj to
> the
> > > >> path
> > > >> >>> >>>> output_path on HDFS.
> > > >> >>> >>>>
> > > >> >>> >>>> This would enable more flexibility and
composibility than
> > the
> > > >> >>> >>>> configuration-based approach that we
have in the flatfile
> > > >> loader.
> > > >> >>> >>>> My concern with this approach, and the
reason I didn't do
> it
> > > >> >>> initially,
> > > >> >>> >>> was
> > > >> >>> >>>> that I think that users will want at
least 2 ways to
> > summarize
> > > >> data
> > > >> >>> (or
> > > >> >>> >>>> load data):
> > > >> >>> >>>>
> > > >> >>> >>>> - A configuration based approach, which
enables a UI
> > > >> >>> >>>> - A set of stellar functions via the
scriptable REPL
> > > >> >>> >>>>
> > > >> >>> >>>> I would argue that both have a place
and I started with
the
> > > >> >>> >> configuration
> > > >> >>> >>>> based approach as it was a more natural
extension of what
> we
> > > >> already
> > > >> >>> >> had.
> > > >> >>> >>>> I'd love to hear thoughts about this
idea too.
> > > >> >>> >>>>
> > > >> >>> >>>>
> > > >> >>> >>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey
Stella <
> > > >> cestella@gmail.com>
> > > >> >>> >>> wrote:
> > > >> >>> >>>>
> > > >> >>> >>>>> Hi all,
> > > >> >>> >>>>>
> > > >> >>> >>>>> I wanted to get some feedback on
a sensible plan for
> > > >> something. It
> > > >> >>> >>>>> occurred to me the other day when
considering the
use-case
> > of
> > > >> >>> >> detecting
> > > >> >>> >>>>> typosquatted domains, that one approach
was to generate
> the
> > > >> set of
> > > >> >>> >>>>> typosquatted domains for some set
of reference domains
and
> > > >> compare
> > > >> >>> >>>> domains
> > > >> >>> >>>>> as they flow through.
> > > >> >>> >>>>>
> > > >> >>> >>>>> One way we could do this would be
to generate this data
> and
> > > >> import
> > > >> >>> >> the
> > > >> >>> >>>>> typosquatted domains into HBase.
I thought, however,
that
> > > >> another
> > > >> >>> >>>> approach
> > > >> >>> >>>>> which may trade-off accuracy to
remove the network hop
and
> > > >> potential
> > > >> >>> >>> disk
> > > >> >>> >>>>> seek by constructing a bloom filter
that includes the
set
> > of
> > > >> >>> >>> typosquatted
> > > >> >>> >>>>> domains.
> > > >> >>> >>>>>
> > > >> >>> >>>>> The challenge was that we don't
have a way to do this
> > > >> currently. We
> > > >> >>> >>> do,
> > > >> >>> >>>>> however, have a loading infrastructure
(e.g. the
> > > >> flatfile_loader)
> > > >> >>> and
> > > >> >>> >>>>> configuration (see https://github.com/apache/
> > > >> >>> >>> metron/tree/master/metron-
> > > >> >>> >>>>> platform/metron-data-management#common-extractor-
> > properties)
> > > >> which
> > > >> >>> >>>>> handles:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - parsing flat files
> > > >> >>> >>>>> - transforming the rows
> > > >> >>> >>>>> - filtering the rows
> > > >> >>> >>>>>
> > > >> >>> >>>>> To enable the new use-case of generating
a summary
object
> > > >> (e.g. a
> > > >> >>> >> bloom
> > > >> >>> >>>>> filter), in METRON-1378 (https://github.com/apache/met
> > > >> ron/pull/879)
> > > >> >>> >> I
> > > >> >>> >>>>> propose that we create a new utility
that uses the same
> > > >> extractor
> > > >> >>> >>> config
> > > >> >>> >>>>> add the ability to:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - initialize a state object
> > > >> >>> >>>>> - update the object for every row
> > > >> >>> >>>>> - merge the state objects (in the
case of multiple
> threads,
> > > in
> > > >> the
> > > >> >>> >>>>> case of one thread it's not needed).
> > > >> >>> >>>>>
> > > >> >>> >>>>> I think this is a sensible decision
because:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - It's a minimal movement from the
flat file loader
> > > >> >>> >>>>> - Uses the same configs
> > > >> >>> >>>>> - Abstracts and reuses the existing
infrastructure
> > > >> >>> >>>>> - Having one extractor config means
that it should be
> > easier
> > > to
> > > >> >>> >>>>> generate a UI around this to simplify
the experience
> > > >> >>> >>>>>
> > > >> >>> >>>>> All that being said, our extractor
config is..shall we
> > > >> >>> say...daunting
> > > >> >>> >>> :).
> > > >> >>> >>>>> I am sensitive to the fact that
this adds to an existing
> > > >> difficult
> > > >> >>> >>>> config.
> > > >> >>> >>>>> I propose that this is an initial
step forward to
support
> > the
> > > >> >>> >> use-case
> > > >> >>> >>>> and
> > > >> >>> >>>>> we can enable something more composable
going forward.
My
> > > >> concern
> > > >> >>> in
> > > >> >>> >>>>> considering this as the first step
was that it felt that
> > the
> > > >> >>> >> composable
> > > >> >>> >>>>> units for data transformation and
manipulation suddenly
> > takes
> > > >> us
> > > >> >>> >> into a
> > > >> >>> >>>>> place where Stellar starts to look
like Pig or Spark RDD
> > > API. I
> > > >> >>> >> wasn't
> > > >> >>> >>>>> ready for that without a lot more
discussion.
> > > >> >>> >>>>>
> > > >> >>> >>>>> To summarize, what I'd like to get
from the community
is,
> > > after
> > > >> >>> >>> reviewing
> > > >> >>> >>>>> the entire use-case at https://github.com/cestella/
> > > >> >>> >>>> incubator-metron/tree/
> > > >> >>> >>>>> typosquat_merge/use-cases/typosquat_detection:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - Is this so confusing that it does
not belong in Metron
> > even
> > > >> as a
> > > >> >>> >>>>> first-step?
> > > >> >>> >>>>> - Is there a way to extend the extractor
config in a
less
> > > >> >>> >> confusing
> > > >> >>> >>>>> way to enable this?
> > > >> >>> >>>>>
> > > >> >>> >>>>> I apologize for making the discuss
thread *after* the
> > JIRAs,
> > > >> but I
> > > >> >>> >> felt
> > > >> >>> >>>>> this one might bear having some
working code to
consider.
> > > >> >>> >>>>>
> > > >> >>> >>>>
> > > >> >>> >>>
> > > >> >>> >>
> > > >>
> > > >> -------------------
> > > >> Thank you,
> > > >>
> > > >> James Sirota
> > > >> PMC- Apache Metron
> > > >> jsirota AT apache DOT org
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message