metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otto Fowler <ottobackwa...@gmail.com>
Subject Re: [DISCUSS] Generating and Interacting with serialized summary objects
Date Fri, 05 Jan 2018 17:04:53 GMT
I would imagine the ‘stellar-object-repo’ would be part of the global
configuration or configuration passed to the command.
why specify in the function itself?




On January 5, 2018 at 11:22:32, Casey Stella (cestella@gmail.com) wrote:

I like that, specifically the repositories abstraction. Perhaps we can
construct some longer term JIRAs for extensions.
For the current state of affairs (wrt to the OBJECT_GET call) I was
imagining the simple default HDFS solution as a first cut and
following on adding a repository name (e.g. OBJECT_GET(path, repo_name)
with repo_name being optional and defaulting to HDFS
for backwards compatibility.

In effect, this would be the next step that I'm proposing OBJECT_GET(paths,
repo_name, repo_config) which would be backwards compatible

- paths - a single path or a list of paths (if a list, then a list of
objects returned)
- repo_name - optional name for repo, defaulted to HDFS if we don't
specify
- repo_config - optional config map


This would open things like:

- OBJECT_GET('key', 'HBASE', { 'hbase.table' : 'table', 'hbase.cf' :
'cf'} ) -- pulling from HBase

Eventually we might also be able to fold ENRICHMENT_GET as just a special
repo instance.

On Fri, Jan 5, 2018 at 10:26 AM, Otto Fowler <ottobackwards@gmail.com>
wrote:

> If we separate the concerns as I have state previously :
>
> 1. Stellar can load objects into ‘caches’ from some repository and refer
to
> them.
> 2. The repositories
> 3. Some number of strategies to populate and possibly update the
> repository, from spark,
> to MR jobs to whatever you would classify the flat file stuff as.
> 4. Let the Stellar API for everything but LOAD() follow after we get
usage
>
> Then the particulars of ‘3’ are less important.
>
>
>
> On January 5, 2018 at 09:02:41, Justin Leet (justinjleet@gmail.com)
wrote:
>
> I agree with the general sentiment that we can tailor specific use cases
> via UI, and I'm worried that the use case specific solution (particularly
> in light of the note that it's not even general to the class of bloom
> filter problems, let alone an actually general problem) becomes more work
> than this as soon as about 2 more uses cases actually get realized.
> Pushing that to the UI lets people solve a variety of problems if they
> really want to dig in, while still giving flexibility to provide a more
> tailored experience for what we discover the 80% cases are in practice.
>
> Keeping in mind I am mostly unfamiliar with the extractor config itself,
I
> am wondering if it makes sense to split up the config a bit. While a lot
> of implementation details are shared, maybe the extractor config itself
> should be refactored into a couple parts analogous to ETL (as a follow on
> task, I think if this is true, it predates Casey's proposed change). It
> doesn't necessarily make it less complex, but it might make it more
easily
> digestible if it's split up by idea (parsing, transformation, etc.).
>
> Re: Mike's point, I don't think we want the actual processing broken up
as
> ETL, but the representation to the user in terms of configuration could
be
> similar (Since we're already doing parsing and transformation). We don't
> have to implement it as an ETL pipeline, but it does potentially offer
the
> user a way to quickly grasp what the JSON blob is actually specifying.
> Making it easy to understand, even if it's not the ideal way to interact
is
> potentially still a win.
>
> On Thu, Jan 4, 2018 at 1:28 PM, Michael Miklavcic <
> michael.miklavcic@gmail.com> wrote:
>
> > I mentioned this earlier, but I'll reiterate that I think this approach
> > gives us the ability to make specific use cases via a UI, or other
> > interface should we choose to add one, while keeping the core adaptable
> and
> > flexible. This is ideal for middle tier as I think this effectively
gives
> > us the ability to pivot to other use cases very easily while not being
so
> > generic as to be useless. The fact that you were able to create this as
> > quickly as you did seems to me directly related to the fact we made the
> > decision to keep the loader somewhat flexible rather than very
specific.
> > The operation ordering and state carry from one phase of processing to
> the
> > next would simply have been inscrutable, if not impossible, with a CLI
> > option-only approach. Sure, it's not as simple as "put infile.txt
> > outfile.txt", but the alternatives are not that clear either. One might
> > argue we could split up the processing pieces as in traditional Hadoop,
> eg
> > ETL: Sqoop ingest -> HDFS -> mapreduce, pig, hive, or spark transform.
> But
> > quite frankly that's going in the *opposite* direction I think we want
> > here. That's more complex in terms of moving parts. The config approach
> > with pluggable Stellar insulates users from specific implementations,
but
> > also gives you the ability to pass lower level constructs, eg Spark SQL
> or
> > HiveQL, should the need arise.
> >
> > In summary, my impressions are that at this point the features and
level
> of
> > abstraction feel appropriate to me. I think it buys us 1) learning from
a
> > starting typosquatting use case, 2) flexibility to change and adapt it
> > without affecting users, and 3) enough concrete capability to make more
> > specific use cases easy to deliver with a UI.
> >
> > Cheers,
> > Mike
> >
> > On Jan 4, 2018 9:59 AM, "Casey Stella" <cestella@gmail.com> wrote:
> >
> > > It also occurs to me that even in this situation, it's not a
sufficient
> > > generalization for just Bloom, but this is a bloom filter of the
output
> > of
> > > the all the typosquatted domains for the domain in each row. If we
> > wanted
> > > to hard code, we'd have to hard code specifically the bloom filter
> *for*
> > > typosquatting use-case. Hard coding this would prevent things like
> bloom
> > > filters containing malicious IPs from a reference source, for
instance.
> > >
> > > On Thu, Jan 4, 2018 at 10:46 AM, Casey Stella <cestella@gmail.com>
> > wrote:
> > >
> > > > So, there is value outside of just bloom usage. The most specific
> > > example
> > > > of this would be in order to configure a bloom filter, we need to
> know
> > at
> > > > least an upper bound of the number of items that are going to be
> added
> > to
> > > > the bloom filter. In order to do that, we need to count the number
of
> > > > typosquatted domains. Specifically at https://github.com/
> > > > cestella/incubator-metron/tree/typosquat_merge/use-
> > > > cases/typosquat_detection#configure-the-bloom-filter you can see
how
> > we
> > > > use the CONSOLE writer with an extractor config to count the number
> of
> > > > typosquatted domains in the alexa top 10k dataset so we can size
the
> > > filter
> > > > appropriately.
> > > >
> > > > I'd argue that other types of probabalistic data structures could
> also
> > > > make sense here as well, like statistical sketches. Consider, for
> > > instance,
> > > > a cheap and dirty DGA indicator where we take the Alexa top 1M and
> look
> > > at
> > > > the distribution of shannon entropy in the domains. If the shannon
> > > entropy
> > > > of a domain going across metron is more than 5 std devs from the
> mean,
> > > that
> > > > could be circumstantial evidence of a malicious attack. This would
> > > yield a
> > > > lot of false positives, but used in conjunction with other
indicators
> > it
> > > > could be valuable.
> > > >
> > > > Computing that would be as follows:
> > > >
> > > > {
> > > > "config" : {
> > > > "columns" : {
> > > > "rank" : 0,
> > > > "domain" : 1
> > > > },
> > > > "value_transform" : {
> > > > "domain" : "DOMAIN_REMOVE_TLD(domain)"
> > > > },
> > > > "value_filter" : "LENGTH(domain) > 0",
> > > > "state_init" : "STATS_INIT()",
> > > > "state_update" : {
> > > > "state" : "STATS_ADD(state, STRING_ENTROPY(domain))"
> > > > },
> > > > "state_merge" : "STATS_MERGE(states)",
> > > > "separator" : ","
> > > > },
> > > > "extractor" : "CSV"
> > > > }
> > > >
> > > > Also, for another example, imagine a situation where we have a
> > SPARK_SQL
> > > > engine rather than just LOCAL for summarizing. We could create a
> > general
> > > > summary of URL lengths in bro data which could be used for
> determining
> > if
> > > > someone is trying to send in very large URLs maliciously (see Jon
> > > Zeolla's
> > > > concerns in https://issues.apache.org/jira/browse/METRON-517 for a
> > > > discussion of this). In order to do that, we could simply execute:
> > > >
> > > > $METRON_HOME/bin/flatfile_summarizer.sh -i "select uri from bro" -o
> > > /tmp/reference/bro_uri_distribution.ser -e
~/uri_length_extractor.json
> > -p
> > > 5 -om HDFS -m SPARK_SQL
> > > >
> > > > with uri_length_extractor.json containing:
> > > >
> > > > {
> > > > "config" : {
> > > > "value_filter" : "LENGTH(uri) > 0",
> > > > "state_init" : "STATS_INIT()",
> > > > "state_update" : {
> > > > "state" : "STATS_ADD(state, LENGTH(uri))"
> > > > },
> > > > "state_merge" : "STATS_MERGE(states)",
> > > > "separator" : ","
> > > > },
> > > > "extractor" : "SQL_ROW"
> > > > }
> > > >
> > > >
> > > > Regarding value filter, that's already around in the extractor
config
> > > > because of the need to transform data in the flatfile loader. While
I
> > > > definitely see the desire to use unix tools to prep data, there are
> > some
> > > > things that aren't as easy to do. For instance, here, removing the
> TLD
> > > of
> > > > a domain is not a trivial task in a shell script and we have
existing
> > > > functions for that in Stellar. I would see people using both.
> > > >
> > > > To address the issue of a more targeted experience to bloom, I
think
> > that
> > > > sort of specialization should best exist in the UI layer. Having a
> > more
> > > > complete and expressive backend reused across specific UIs seems to
> be
> > > the
> > > > best of all worlds. It allows power users to drop down and do more
> > > complex
> > > > things and still provides a (mostly) code-free and targeted
> experience
> > > for
> > > > users. It seems to me that limiting the expressibility in the
backend
> > > > isn't the right way to go since this work just fits in with our
> > existing
> > > > engine.
> > > >
> > > >
> > > > On Thu, Jan 4, 2018 at 1:40 AM, James Sirota <jsirota@apache.org>
> > wrote:
> > > >
> > > >> I just went through these pull requests as well and also agree
this
> is
> > > >> good work. I think it's a good first pass. I would be careful with
> > > trying
> > > >> to boil the ocean here. I think for the initial use case I would
> only
> > > >> support loading the bloom filters from HDFS. If people want to
> > > pre-process
> > > >> the CSV file of domains using awk or sed this should be out of
scope
> > of
> > > >> this work. It's easy enough to do out of band and I would not
> include
> > > any
> > > >> of these functions at all. I also think that the config could be
> > > >> considerably simplified. I think value_filter should be removed
> > (since
> > > I
> > > >> believe that preprocessing should be done by the user outside of
> this
> > > >> process). I also have a question about the init, update, and merge
> > > >> configurations. Would I ever initialize to anything but an empty
> > bloom
> > > >> filter? For the state update would I ever do anything other than
add
> > to
> > > >> the bloom filter? For the state merge would I ever do anything
other
> > > than
> > > >> merge the states? If the answer to these is 'no', then this should
> > > simply
> > > >> be hard coded and not externalized into config values.
> > > >>
> > > >> 03.01.2018, 14:20, "Michael Miklavcic" <michael.miklavcic@gmail.com
> >:
>
> > > >> > I just finished stepping through the typosquatting use case
README
> > in
> > > >> your
> > > >> > merge branch. This is really, really good work Casey. I see most
> of
> > > our
> > > >> > previous documentation issues addressed up front, e.g. special
> > > variables
> > > >> > are cited, all new fields explained, side effects documented.
The
> > use
> > > >> case
> > > >> > doc brings it all together soup-to-nuts and I think all the
pieces
> > > make
> > > >> > sense in a mostly self-contained way. I can't think of anything
I
> > had
> > > to
> > > >> > sit and think about for more than a few seconds. I'll be making
my
> > way
> > > >> > through your individual PR's in more detail, but my first
> > impressions
> > > >> are
> > > >> > that this is excellent.
> > > >> >
> > > >> > On Wed, Jan 3, 2018 at 12:43 PM, Michael Miklavcic <
> > > >> > michael.miklavcic@gmail.com> wrote:
> > > >> >
> > > >> >> I'm liking this design and growth strategy, Casey. I also
think
> > Nick
> > > >> and
> > > >> >> Otto have some valid points. I always find there's a natural
> > tension
> > > >> >> between too little, just enough, and boiling the ocean and
these
> > > >> discuss
> > > >> >> threads really help drive what the short and long term visions
> > > should
> > > >> look
> > > >> >> like.
> > > >> >>
> > > >> >> On the subject of repositories and strategies, I agree that
> > > pluggable
> > > >> >> repos and strategies for modifying them would be useful.
For
the
> > > first
> > > >> >> pass, I'd really like to see HDFS with the proposed set of
> Stellar
> > > >> >> functions. This gives us a lot of bang for our buck - we
can
> > > >> capitalize on
> > > >> >> a set of powerful features around existence checking earlier
> > without
> > > >> having
> > > >> >> to worry about later interface changes impacting users. With
the
> > > >> primary
> > > >> >> interface coming through the JSON config, we are building
a
nice
> > > >> facade
> > > >> >> that protects users from later implementation abstractions
and
> > > >> >> improvements, all while providing a stable enough interface
on
> > which
> > > >> we can
> > > >> >> develop UI features as desired. I'd be interested to hear
more
> > about
> > > >> what
> > > >> >> features could be provided by a repository as time goes by.
> > > >> Federation,
> > > >> >> permissions, governance, metadata management, perhaps?
> > > >> >>
> > > >> >> I also had some concern over duplicating existing Unix
features.
> I
> > > >> think
> > > >> >> where I'm at has been largely addressed by Casey's comments
on
1)
> > > >> scaling,
> > > >> >> 2) multiple variables, and 3) portability to Hadoop. Providing
2
> > > >> approaches
> > > >> >> - 1 which is config-based and the other a composable set
of
> > > functions
> > > >> gives
> > > >> >> us the ability to provide a core set of features that can
later
> be
> > > >> easily
> > > >> >> expanded by users as the need arises. Here again I think
the
> > > >> prescribed
> > > >> >> approach provides a strong first pass that we can then expand
on
> > > >> without
> > > >> >> concern of future improvements becoming a hassle for end
users.
> > > >> >>
> > > >> >> Best,
> > > >> >> Mike
> > > >> >>
> > > >> >> On Wed, Jan 3, 2018 at 10:25 AM, Simon Elliston Ball <
> > > >> >> simon@simonellistonball.com> wrote:
> > > >> >>
> > > >> >>> There is some really cool stuff happening here, if only
I’d
been
> > > >> allowed
> > > >> >>> to see the lists over Christmas... :)
> > > >> >>>
> > > >> >>> A few thoughts...
> > > >> >>>
> > > >> >>> I like Otto’s generalisation of the problem to include
specific
> > > local
> > > >> >>> stellar objects in a cache loaded from a store (HDFS
seems a
> > > >> natural, but
> > > >> >>> not only place, maybe even a web service / local microservicey
> > > object
> > > >> >>> provider!?) That said, I suspect that’s a good platform
> > > optimisation
> > > >> >>> approach. Should we look at this as a separate piece
of work
> > given
> > > it
> > > >> >>> extends beyond the scope of the summarisation concept
and
> > > ultimately
> > > >> use it
> > > >> >>> as a back-end to feed the summarising engine proposed
here for
> > the
> > > >> >>> enrichment loader?
> > > >> >>>
> > > >> >>> On the more specific use case, one think I would comment
on is
> > the
> > > >> >>> configuration approach. The iteration loop
> > > (state_{init|update|merge}
> > > >> >>> should be consistent with the way we handle things like
the
> > > profiler
> > > >> >>> config, since it’s the same approach to data handling.
> > > >> >>>
> > > >> >>> The other thing that seems to have crept in here is the
> interface
> > > to
> > > >> >>> something like Spark, which again, I am really very very
keen
on
> > > >> seeing
> > > >> >>> happen. That said, not sure how that would happen in
this
> > context,
> > > >> unless
> > > >> >>> you’re talking about pushing to something like livy
for
example
> > > >> (eminently
> > > >> >>> sensible for things like cross instance caching and faster
> > RPC-ish
> > > >> access
> > > >> >>> to an existing spark context which seem to be what Casey
is
> > driving
> > > >> at with
> > > >> >>> the spark piece.
> > > >> >>>
> > > >> >>> To address the question of text manipulation in Stellar
/
metron
> > > >> >>> enrichment ingest etc, we already have this outside of
the
> > context
> > > >> of the
> > > >> >>> issues here. I would argue that yes, we don’t want
too many
> paths
> > > >> for this,
> > > >> >>> and that maybe our parser approach might be heavily related
to
> > > >> text-based
> > > >> >>> ingest. I would say the scope worth dealing with here
though
is
> > not
> > > >> really
> > > >> >>> text manipulation, but summarisation, which is not well
served
> by
> > > >> existing
> > > >> >>> CLI tools like awk / sed and friends.
> > > >> >>>
> > > >> >>> Simon
> > > >> >>>
> > > >> >>> > On 3 Jan 2018, at 15:48, Nick Allen <nick@nickallen.org>
> > wrote:
> > > >> >>> >
> > > >> >>> >> Even with 5 threads, it takes an hour for the
full Alexa
1m,
> > so
> > > I
> > > >> >>> think
> > > >> >>> > this will impact performance
> > > >> >>> >
> > > >> >>> > What exactly takes an hour? Adding 1M entries to
a bloom
> > filter?
> > > >> That
> > > >> >>> > seems really high, unless I am not understanding
something.
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> > On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <
> > > cestella@gmail.com>
> > > >> >>> wrote:
> > > >> >>> >
> > > >> >>> >> Thanks for the feedback, Nick.
> > > >> >>> >>
> > > >> >>> >> Regarding "IMHO, I'd rather not reinvent the
wheel for text
> > > >> >>> manipulation."
> > > >> >>> >>
> > > >> >>> >> I would argue that we are not reinventing the
wheel for
text
> > > >> >>> manipulation
> > > >> >>> >> as the extractor config exists already and we
are doing a
> > > similar
> > > >> >>> thing in
> > > >> >>> >> the flatfile loader (in fact, the code is reused
and merely
> > > >> extended).
> > > >> >>> >> Transformation operations are already supported
in our
> > codebase
> > > >> in the
> > > >> >>> >> extractor config, this PR has just added some
hooks for
> > stateful
> > > >> >>> >> operations.
> > > >> >>> >>
> > > >> >>> >> Furthermore, we will need a configuration object
to pass to
> > the
> > > >> REST
> > > >> >>> call
> > > >> >>> >> if we are ever to create a UI around importing
data into
> hbase
> > > or
> > > >> >>> creating
> > > >> >>> >> these summary objects.
> > > >> >>> >>
> > > >> >>> >> Regarding your example:
> > > >> >>> >> $ cat top-1m.csv | awk -F, '{print $2}' | sed
'/^$/d' |
> > stellar
> > > -i
> > > >> >>> >> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > > >> >>> >>
> > > >> >>> >> I'm very sympathetic to this type of extension,
but it has
> > some
> > > >> issues:
> > > >> >>> >>
> > > >> >>> >> 1. This implies a single-threaded addition to
the bloom
> > filter.
> > > >> >>> >> 1. Even with 5 threads, it takes an hour for
the full alexa
> > 1m,
> > > >> >>> so I
> > > >> >>> >> think this will impact performance
> > > >> >>> >> 2. There's not a way to specify how to merge
across threads
> if
> > > we
> > > >> >>> do
> > > >> >>> >> make a multithread command line option
> > > >> >>> >> 2. This restricts these kinds of operations
to roles with
> > heavy
> > > >> unix
> > > >> >>> CLI
> > > >> >>> >> knowledge, which isn't often the types of people
who would
be
> > > >> doing
> > > >> >>> this
> > > >> >>> >> type of operation
> > > >> >>> >> 3. What if we need two variables passed to stellar?
> > > >> >>> >> 4. This approach will be harder to move to Hadoop.
Eventually
> > we
> > > >> >>> will
> > > >> >>> >> want to support data on HDFS being processed
by Hadoop
> > (similar
> > > to
> > > >> >>> >> flatfile
> > > >> >>> >> loader), so instead of -m LOCAL being passed
for the
flatfile
> > > >> >>> summarizer
> > > >> >>> >> you'd pass -m SPARK and the processing would
happen on the
> > > cluster
> > > >> >>> >> 1. This is particularly relevant in this case
as it's a
> > > >> >>> >> embarrassingly parallel problem in general
> > > >> >>> >>
> > > >> >>> >> In summary, while this a CLI approach is attractive,
I
prefer
> > > the
> > > >> >>> extractor
> > > >> >>> >> config solution because it is the solution with
the
smallest
> > > >> iteration
> > > >> >>> >> that:
> > > >> >>> >>
> > > >> >>> >> 1. Reuses existing metron extraction infrastructure
> > > >> >>> >> 2. Provides the most solid base for the extensions
that
will
> > be
> > > >> >>> sorely
> > > >> >>> >> needed soon (and will keep it in parity with
the flatfile
> > > loader)
> > > >> >>> >> 3. Provides the most solid base for a future
UI extension
in
> > the
> > > >> >>> >> management UI to support both summarization
and loading
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >>
> > > >> >>> >> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen
<
> > > nick@nickallen.org>
> > > >> >>> wrote:
> > > >> >>> >>
> > > >> >>> >>> First off, I really do like the typosquatting
use case and
a
> > > lot
> > > >> of
> > > >> >>> what
> > > >> >>> >>> you have described.
> > > >> >>> >>>
> > > >> >>> >>>> We need a way to generate the summary
sketches from flat
> > data
> > > >> for
> > > >> >>> this
> > > >> >>> >> to
> > > >> >>> >>>> work.
> > > >> >>> >>>> ​..​
> > > >> >>> >>>>
> > > >> >>> >>>
> > > >> >>> >>> I took this quote directly from your use
case. Above is
the
> > > point
> > > >> >>> that
> > > >> >>> >> I'd
> > > >> >>> >>> like to discuss and what your proposed solutions
center
on.
> > > This
> > > >> is
> > > >> >>> >> what I
> > > >> >>> >>> think you are trying to do, at least with
PR #879
> > > >> >>> >>> <https://github.com/apache/metron/pull/879>...
> > > >> >>> >>>
> > > >> >>> >>> (Q) Can we repurpose Stellar functions so
that they can
> > operate
> > > >> on
> > > >> >>> text
> > > >> >>> >>> stored in a file system?
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>> Whether we use the (1) Configuration or
the (2)
> > Function-based
> > > >> >>> approach
> > > >> >>> >>> that you described, fundamentally we are
introducing new
> ways
> > > to
> > > >> >>> perform
> > > >> >>> >>> text manipulation inside of Stellar.
> > > >> >>> >>>
> > > >> >>> >>> IMHO, I'd rather not reinvent the wheel
for text
> > manipulation.
> > > It
> > > >> >>> would
> > > >> >>> >> be
> > > >> >>> >>> painful to implement and maintain a bunch
of Stellar
> > functions
> > > >> for
> > > >> >>> text
> > > >> >>> >>> manipulation. People already have a large
number of tools
> > > >> available
> > > >> >>> to
> > > >> >>> >> do
> > > >> >>> >>> this and everyone has their favorites. People
are
resistant
> > to
> > > >> >>> learning
> > > >> >>> >>> something new when they already are familiar
with another
> way
> > > to
> > > >> do
> > > >> >>> the
> > > >> >>> >>> same thing.
> > > >> >>> >>>
> > > >> >>> >>> So then the question is, how else can we
do this? My
> > suggestion
> > > >> is
> > > >> >>> that
> > > >> >>> >>> rather than introducing text manipulation
tools inside of
> > > >> Stellar, we
> > > >> >>> >> allow
> > > >> >>> >>> people to use the text manipulation tools
they already
know,
> > > but
> > > >> with
> > > >> >>> the
> > > >> >>> >>> Stellar functions that we already have.
And the obvious
way
> > to
> > > >> tie
> > > >> >>> those
> > > >> >>> >>> two things together is the Unix pipeline.
> > > >> >>> >>>
> > > >> >>> >>> A quick, albeit horribly incomplete, example
to flesh this
> > out
> > > a
> > > >> bit
> > > >> >>> more
> > > >> >>> >>> based on the example you have in PR #879
> > > >> >>> >>> <https://github.com/apache/metron/pull/879>.
This would
> > allow
> > > >> me to
> > > >> >>> >>> integrate Stellar with whatever external
tools that I
want.
> > > >> >>> >>>
> > > >> >>> >>> $ cat top-1m.csv | awk -F, '{print $2}'
| sed '/^$/d' |
> > stellar
> > > >> -i
> > > >> >>> >>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>>
> > > >> >>> >>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella
<
> > > >> cestella@gmail.com>
> > > >> >>> >> wrote:
> > > >> >>> >>>
> > > >> >>> >>>> I'll start this discussion off with
my idea around a 2nd
> > step
> > > >> that is
> > > >> >>> >>> more
> > > >> >>> >>>> adaptable. I propose the following set
of stellar
functions
> > > >> backed
> > > >> >>> by
> > > >> >>> >>>> Spark in the metron-management project:
> > > >> >>> >>>>
> > > >> >>> >>>> - CSV_PARSE(location, separator?, columns?)
: Constructs
a
> > > Spark
> > > >> >>> >>>> Dataframe for reading the flatfile
> > > >> >>> >>>> - SQL_TRANSFORM(dataframe, spark sql
statement):
Transforms
> > > the
> > > >> >>> >>>> dataframe
> > > >> >>> >>>> - SUMMARIZE(state_init, state_update,
state_merge):
> > Summarize
> > > >> the
> > > >> >>> >>>> dataframe using the lambda functions:
> > > >> >>> >>>> - state_init - executed once per worker
to initialize the
> > > state
> > > >> >>> >>>> - state_update - executed once per row
> > > >> >>> >>>> - state_merge - Merge the worker states
into one worker
> > state
> > > >> >>> >>>> - OBJECT_SAVE(obj, output_path) : Save
the object obj to
> the
> > > >> path
> > > >> >>> >>>> output_path on HDFS.
> > > >> >>> >>>>
> > > >> >>> >>>> This would enable more flexibility and
composibility than
> > the
> > > >> >>> >>>> configuration-based approach that we
have in the flatfile
> > > >> loader.
> > > >> >>> >>>> My concern with this approach, and the
reason I didn't do
> it
> > > >> >>> initially,
> > > >> >>> >>> was
> > > >> >>> >>>> that I think that users will want at
least 2 ways to
> > summarize
> > > >> data
> > > >> >>> (or
> > > >> >>> >>>> load data):
> > > >> >>> >>>>
> > > >> >>> >>>> - A configuration based approach, which
enables a UI
> > > >> >>> >>>> - A set of stellar functions via the
scriptable REPL
> > > >> >>> >>>>
> > > >> >>> >>>> I would argue that both have a place
and I started with
the
> > > >> >>> >> configuration
> > > >> >>> >>>> based approach as it was a more natural
extension of what
> we
> > > >> already
> > > >> >>> >> had.
> > > >> >>> >>>> I'd love to hear thoughts about this
idea too.
> > > >> >>> >>>>
> > > >> >>> >>>>
> > > >> >>> >>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey
Stella <
> > > >> cestella@gmail.com>
> > > >> >>> >>> wrote:
> > > >> >>> >>>>
> > > >> >>> >>>>> Hi all,
> > > >> >>> >>>>>
> > > >> >>> >>>>> I wanted to get some feedback on
a sensible plan for
> > > >> something. It
> > > >> >>> >>>>> occurred to me the other day when
considering the
use-case
> > of
> > > >> >>> >> detecting
> > > >> >>> >>>>> typosquatted domains, that one approach
was to generate
> the
> > > >> set of
> > > >> >>> >>>>> typosquatted domains for some set
of reference domains
and
> > > >> compare
> > > >> >>> >>>> domains
> > > >> >>> >>>>> as they flow through.
> > > >> >>> >>>>>
> > > >> >>> >>>>> One way we could do this would be
to generate this data
> and
> > > >> import
> > > >> >>> >> the
> > > >> >>> >>>>> typosquatted domains into HBase.
I thought, however,
that
> > > >> another
> > > >> >>> >>>> approach
> > > >> >>> >>>>> which may trade-off accuracy to
remove the network hop
and
> > > >> potential
> > > >> >>> >>> disk
> > > >> >>> >>>>> seek by constructing a bloom filter
that includes the
set
> > of
> > > >> >>> >>> typosquatted
> > > >> >>> >>>>> domains.
> > > >> >>> >>>>>
> > > >> >>> >>>>> The challenge was that we don't
have a way to do this
> > > >> currently. We
> > > >> >>> >>> do,
> > > >> >>> >>>>> however, have a loading infrastructure
(e.g. the
> > > >> flatfile_loader)
> > > >> >>> and
> > > >> >>> >>>>> configuration (see https://github.com/apache/
> > > >> >>> >>> metron/tree/master/metron-
> > > >> >>> >>>>> platform/metron-data-management#common-extractor-
> > properties)
> > > >> which
> > > >> >>> >>>>> handles:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - parsing flat files
> > > >> >>> >>>>> - transforming the rows
> > > >> >>> >>>>> - filtering the rows
> > > >> >>> >>>>>
> > > >> >>> >>>>> To enable the new use-case of generating
a summary
object
> > > >> (e.g. a
> > > >> >>> >> bloom
> > > >> >>> >>>>> filter), in METRON-1378 (https://github.com/apache/met
> > > >> ron/pull/879)
> > > >> >>> >> I
> > > >> >>> >>>>> propose that we create a new utility
that uses the same
> > > >> extractor
> > > >> >>> >>> config
> > > >> >>> >>>>> add the ability to:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - initialize a state object
> > > >> >>> >>>>> - update the object for every row
> > > >> >>> >>>>> - merge the state objects (in the
case of multiple
> threads,
> > > in
> > > >> the
> > > >> >>> >>>>> case of one thread it's not needed).
> > > >> >>> >>>>>
> > > >> >>> >>>>> I think this is a sensible decision
because:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - It's a minimal movement from the
flat file loader
> > > >> >>> >>>>> - Uses the same configs
> > > >> >>> >>>>> - Abstracts and reuses the existing
infrastructure
> > > >> >>> >>>>> - Having one extractor config means
that it should be
> > easier
> > > to
> > > >> >>> >>>>> generate a UI around this to simplify
the experience
> > > >> >>> >>>>>
> > > >> >>> >>>>> All that being said, our extractor
config is..shall we
> > > >> >>> say...daunting
> > > >> >>> >>> :).
> > > >> >>> >>>>> I am sensitive to the fact that
this adds to an existing
> > > >> difficult
> > > >> >>> >>>> config.
> > > >> >>> >>>>> I propose that this is an initial
step forward to
support
> > the
> > > >> >>> >> use-case
> > > >> >>> >>>> and
> > > >> >>> >>>>> we can enable something more composable
going forward.
My
> > > >> concern
> > > >> >>> in
> > > >> >>> >>>>> considering this as the first step
was that it felt that
> > the
> > > >> >>> >> composable
> > > >> >>> >>>>> units for data transformation and
manipulation suddenly
> > takes
> > > >> us
> > > >> >>> >> into a
> > > >> >>> >>>>> place where Stellar starts to look
like Pig or Spark RDD
> > > API. I
> > > >> >>> >> wasn't
> > > >> >>> >>>>> ready for that without a lot more
discussion.
> > > >> >>> >>>>>
> > > >> >>> >>>>> To summarize, what I'd like to get
from the community
is,
> > > after
> > > >> >>> >>> reviewing
> > > >> >>> >>>>> the entire use-case at https://github.com/cestella/
> > > >> >>> >>>> incubator-metron/tree/
> > > >> >>> >>>>> typosquat_merge/use-cases/typosquat_detection:
> > > >> >>> >>>>>
> > > >> >>> >>>>> - Is this so confusing that it does
not belong in Metron
> > even
> > > >> as a
> > > >> >>> >>>>> first-step?
> > > >> >>> >>>>> - Is there a way to extend the extractor
config in a
less
> > > >> >>> >> confusing
> > > >> >>> >>>>> way to enable this?
> > > >> >>> >>>>>
> > > >> >>> >>>>> I apologize for making the discuss
thread *after* the
> > JIRAs,
> > > >> but I
> > > >> >>> >> felt
> > > >> >>> >>>>> this one might bear having some
working code to
consider.
> > > >> >>> >>>>>
> > > >> >>> >>>>
> > > >> >>> >>>
> > > >> >>> >>
> > > >>
> > > >> -------------------
> > > >> Thank you,
> > > >>
> > > >> James Sirota
> > > >> PMC- Apache Metron
> > > >> jsirota AT apache DOT org
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message