metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Elliston Ball <si...@simonellistonball.com>
Subject Re: [DISCUSS] Generating and Interacting with serialized summary objects
Date Wed, 03 Jan 2018 17:25:27 GMT
There is some really cool stuff happening here, if only I’d been allowed to see the lists
over Christmas... :)

A few thoughts...

I like Otto’s generalisation of the problem to include specific local stellar objects in
a cache loaded from a store (HDFS seems a natural, but not only place, maybe even a web service
/ local microservicey object provider!?) That said, I suspect that’s a good platform optimisation
approach. Should we look at this as a separate piece of work given it extends beyond the scope
of the summarisation concept and ultimately use it as a back-end to feed the summarising engine
proposed here for the enrichment loader?

On the more specific use case, one think I would comment on is the configuration approach.
The iteration loop (state_{init|update|merge} should be consistent with the way we handle
things like the profiler config, since it’s the same approach to data handling. 

The other thing that seems to have crept in here is the interface to something like Spark,
which again, I am really very very keen on seeing happen. That said, not sure how that would
happen in this context, unless you’re talking about pushing to something like livy for example
(eminently sensible for things like cross instance caching and faster RPC-ish access to an
existing spark context which seem to be what Casey is driving at with the spark piece. 

To address the question of text manipulation in Stellar / metron enrichment ingest etc, we
already have this outside of the context of the issues here. I would argue that yes, we don’t
want too many paths for this, and that maybe our parser approach might be heavily related
to text-based ingest. I would say the scope worth dealing with here though is not really text
manipulation, but summarisation, which is not well served by existing CLI tools like awk /
sed and friends.

Simon

> On 3 Jan 2018, at 15:48, Nick Allen <nick@nickallen.org> wrote:
> 
>> Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
> this will impact performance
> 
> What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> seems really high, unless I am not understanding something.
> 
> 
> 
> 
> 
> 
> On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <cestella@gmail.com> wrote:
> 
>> Thanks for the feedback, Nick.
>> 
>> Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."
>> 
>> I would argue that we are not reinventing the wheel for text manipulation
>> as the extractor config exists already and we are doing a similar thing in
>> the flatfile loader (in fact, the code is reused and merely extended).
>> Transformation operations are already supported in our codebase in the
>> extractor config, this PR has just added some hooks for stateful
>> operations.
>> 
>> Furthermore, we will need a configuration object to pass to the REST call
>> if we are ever to create a UI around importing data into hbase or creating
>> these summary objects.
>> 
>> Regarding your example:
>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>> 
>> I'm very sympathetic to this type of extension, but it has some issues:
>> 
>>   1. This implies a single-threaded addition to the bloom filter.
>>      1. Even with 5 threads, it takes an hour for the full alexa 1m, so I
>>      think this will impact performance
>>      2. There's not a way to specify how to merge across threads if we do
>>      make a multithread command line option
>>   2. This restricts these kinds of operations to roles with heavy unix CLI
>>   knowledge, which isn't often the types of people who would be doing this
>>   type of operation
>>   3. What if we need two variables passed to stellar?
>>   4. This approach will be harder to move to Hadoop.  Eventually we will
>>   want to support data on HDFS being processed by Hadoop (similar to
>> flatfile
>>   loader), so instead of -m LOCAL being passed for the flatfile summarizer
>>   you'd pass -m SPARK and the processing would happen on the cluster
>>      1. This is particularly relevant in this case as it's a
>>      embarrassingly parallel problem in general
>> 
>> In summary, while this a CLI approach is attractive, I prefer the extractor
>> config solution because it is the solution with the smallest iteration
>> that:
>> 
>>   1. Reuses existing metron extraction infrastructure
>>   2. Provides the most solid base for the extensions that will be sorely
>>   needed soon (and will keep it in parity with the flatfile loader)
>>   3. Provides the most solid base for a future UI extension in the
>>   management UI to support both summarization and loading
>> 
>> 
>> 
>> 
>> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <nick@nickallen.org> wrote:
>> 
>>> First off, I really do like the typosquatting use case and a lot of what
>>> you have described.
>>> 
>>>> We need a way to generate the summary sketches from flat data for this
>> to
>>>> work.
>>>> ​..​
>>>> 
>>> 
>>> I took this quote directly from your use case.  Above is the point that
>> I'd
>>> like to discuss and what your proposed solutions center on.  This is
>> what I
>>> think you are trying to do, at least with PR #879
>>> <https://github.com/apache/metron/pull/879>...
>>> 
>>> (Q) Can we repurpose Stellar functions so that they can operate on text
>>> stored in a file system?
>>> 
>>> 
>>> Whether we use the (1) Configuration or the (2) Function-based approach
>>> that you described, fundamentally we are introducing new ways to perform
>>> text manipulation inside of Stellar.
>>> 
>>> IMHO, I'd rather not reinvent the wheel for text manipulation.  It would
>> be
>>> painful to implement and maintain a bunch of Stellar functions for text
>>> manipulation.  People already have a large number of tools available to
>> do
>>> this and everyone has their favorites.  People are resistant to learning
>>> something new when they already are familiar with another way to do the
>>> same thing.
>>> 
>>> So then the question is, how else can we do this?  My suggestion is that
>>> rather than introducing text manipulation tools inside of Stellar, we
>> allow
>>> people to use the text manipulation tools they already know, but with the
>>> Stellar functions that we already have.  And the obvious way to tie those
>>> two things together is the Unix pipeline.
>>> 
>>> A quick, albeit horribly incomplete, example to flesh this out a bit more
>>> based on the example you have in PR #879
>>> <https://github.com/apache/metron/pull/879>.  This would allow me to
>>> integrate Stellar with whatever external tools that I want.
>>> 
>>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
>>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <cestella@gmail.com>
>> wrote:
>>> 
>>>> I'll start this discussion off with my idea around a 2nd step that is
>>> more
>>>> adaptable.  I propose the following set of stellar functions backed by
>>>> Spark in the metron-management project:
>>>> 
>>>>   - CSV_PARSE(location, separator?, columns?) : Constructs a Spark
>>>>   Dataframe for reading the flatfile
>>>>   - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the
>>>> dataframe
>>>>   - SUMMARIZE(state_init, state_update, state_merge): Summarize the
>>>>   dataframe using the lambda functions:
>>>>      - state_init - executed once per worker to initialize the state
>>>>      - state_update - executed once per row
>>>>      - state_merge - Merge the worker states into one worker state
>>>>   - OBJECT_SAVE(obj, output_path) : Save the object obj to the path
>>>>   output_path on HDFS.
>>>> 
>>>> This would enable more flexibility and composibility than the
>>>> configuration-based approach that we have in the flatfile loader.
>>>> My concern with this approach, and the reason I didn't do it initially,
>>> was
>>>> that I think that users will want at least 2 ways to summarize data (or
>>>> load data):
>>>> 
>>>>   - A configuration based approach, which enables a UI
>>>>   - A set of stellar functions via the scriptable REPL
>>>> 
>>>> I would argue that both have a place and I started with the
>> configuration
>>>> based approach as it was a more natural extension of what we already
>> had.
>>>> I'd love to hear thoughts about this idea too.
>>>> 
>>>> 
>>>> On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <cestella@gmail.com>
>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I wanted to get some feedback on a sensible plan for something.  It
>>>>> occurred to me the other day when considering the use-case of
>> detecting
>>>>> typosquatted domains, that one approach was to generate the set of
>>>>> typosquatted domains for some set of reference domains and compare
>>>> domains
>>>>> as they flow through.
>>>>> 
>>>>> One way we could do this would be to generate this data and import
>> the
>>>>> typosquatted domains into HBase.  I thought, however, that another
>>>> approach
>>>>> which may trade-off accuracy to remove the network hop and potential
>>> disk
>>>>> seek by constructing a bloom filter that includes the set of
>>> typosquatted
>>>>> domains.
>>>>> 
>>>>> The challenge was that we don't have a way to do this currently.  We
>>> do,
>>>>> however, have a loading infrastructure (e.g. the flatfile_loader) and
>>>>> configuration (see https://github.com/apache/
>>> metron/tree/master/metron-
>>>>> platform/metron-data-management#common-extractor-properties)  which
>>>>> handles:
>>>>> 
>>>>>   - parsing flat files
>>>>>   - transforming the rows
>>>>>   - filtering the rows
>>>>> 
>>>>> To enable the new use-case of generating a summary object (e.g. a
>> bloom
>>>>> filter), in METRON-1378 (https://github.com/apache/metron/pull/879)
>> I
>>>>> propose that we create a new utility that uses the same extractor
>>> config
>>>>> add the ability to:
>>>>> 
>>>>>   - initialize a state object
>>>>>   - update the object for every row
>>>>>   - merge the state objects (in the case of multiple threads, in the
>>>>>   case of one thread it's not needed).
>>>>> 
>>>>> I think this is a sensible decision because:
>>>>> 
>>>>>   - It's a minimal movement from the flat file loader
>>>>>      - Uses the same configs
>>>>>      - Abstracts and reuses the existing infrastructure
>>>>>   - Having one extractor config means that it should be easier to
>>>>>   generate a UI around this to simplify the experience
>>>>> 
>>>>> All that being said, our extractor config is..shall we say...daunting
>>> :).
>>>>> I am sensitive to the fact that this adds to an existing difficult
>>>> config.
>>>>> I propose that this is an initial step forward to support the
>> use-case
>>>> and
>>>>> we can enable something more composable going forward.  My concern in
>>>>> considering this as the first step was that it felt that the
>> composable
>>>>> units for data transformation and manipulation suddenly takes us
>> into a
>>>>> place where Stellar starts to look like Pig or Spark RDD API.  I
>> wasn't
>>>>> ready for that without a lot more discussion.
>>>>> 
>>>>> To summarize, what I'd like to get from the community is, after
>>> reviewing
>>>>> the entire use-case at https://github.com/cestella/
>>>> incubator-metron/tree/
>>>>> typosquat_merge/use-cases/typosquat_detection:
>>>>> 
>>>>>   - Is this so confusing that it does not belong in Metron even as a
>>>>>   first-step?
>>>>>   - Is there a way to extend the extractor config in a less
>> confusing
>>>>>   way to enable this?
>>>>> 
>>>>> I apologize for making the discuss thread *after* the JIRAs, but I
>> felt
>>>>> this one might bear having some working code to consider.
>>>>> 
>>>> 
>>> 
>> 


Mime
View raw message