metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <ma...@apache.org>
Subject Re: [DISCUSS] Search Concerns
Date Mon, 31 Oct 2016 22:26:29 GMT
Hi Jon, interesting topic.  A couple questions come to mind:

What is the reason we need to store very long non_analyzed values in the Index (which evidently
is not its design sweet spot)?  Is it because:
a. It is valuable to be able to Search against the whole string even when that string is >
32K?
b. If HDFS is not also being used, the Index is the only source of historical truth for these
records?

If the answer is (a), the follow-up question is:  Isn’t it safe to assume that there’s
a timestamp near the beginning of the raw string?  So if we provided a 32K prefix string match,
including the timestamp, wouldn’t that be pretty much as good?  Or are there lots of cases
where the first 32K+, including the timestamp, are truly the same, yet they differ after 32K
of text?  If 32K prefix string match is sufficient for 99% of cases, then a fixed-length truncation
limit slightly less than 32K, on both Index and Search, will suffice.  This is essentially
your approach #2 – and it’s simple.

If the answer is (b), then we can be satisfied with any approach that splits the very long
strings up somehow and stores them all in a way that allows reassembling them in the correct
sequence.  It does require a reformulator for querying in the UI, as you note in your “Concerns
- Thoughts #1” below.

I’m probably misunderstanding, but I don’t see how multi-field helps.  According to the
elastic.co doc you reference, multi-field allows storing and searching both the analyzed and
not_analyzed sub-fields without doubling the storage size (which is clearly very useful),
but the non_analyzed sub-field should still have the 32K limit.  Is this not so?  Or are you
proposing that a multi-field mapping could encapsulate the several sub-strings needed to contain
a >32K string? Eg, as “raw”, “raw1”, “raw2”, etc., where each is <32K ?

A sub-case of approach #2, relating to your second “Other Thoughts”, would be:
Always truncate the indexed string to slightly less than 32K, but store the full value of
any such string in HDFS, and include in the Index a reference (file URI with offset) that
allows retrieving it.  This solution can be limited to just the >32K strings, so other
records will simply lack a URI field.  And it doesn’t have to be federated into Search as
you suggest:  The 32K prefix string Search should be quite adequate as suggested above, and
then the whole string can be read from HDFS if needed for historical reasons.

Cheers,
--Matt

On 10/31/16, 1:38 PM, "Zeolla@GMail.com" <zeolla@gmail.com> wrote:

    Hi All,
    
    I've been doing a bit of bug hunting with bro logs within the search tier
    of Metron over the last couple of weeks.  During this (in this thread I'm
    primarily referring to METRON-517
    <https://issues.apache.org/jira/browse/METRON-517>) I've found a couple of
    issues that I'd like to discuss with the wider population.  The primary
    issue here is due to a limitation in Lucene itself, meaning we will
    encounter the same problem with either Elasticsearch or Solr as far as I
    can tell.
    
    *Problems*
    
    1. Lucene, including the latest version (6.2.1), appears to have a hard
    coded maximum term length of 32766 (reference
    <https://lucene.apache.org/core/6_2_1/core/constant-values.html%22%20%5Cl%20%22org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH>
    here <%22> for
    <https://github.com/apache/lucene-solr/search?utf8=%E2%9C%93&q=32766>
    details <https://github.com/apache/lucene-solr/searc>).  If the
    indexingBolt attempts to input a non_analyzed
    <https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html%22%20%5Cl%20%22_index_2>
    tuple (set via these files <https://github.co>) which exceeds that limit,
    the entire message is rejected.
     - If you simply analyze the field it reduces the size for any individual
    term, but it also throws a wrench in your queries, when you are searching
    for a match of that entire field.
    
    2. From what I can tell, failures are only logged via
    /var/log/elasticsearch/metron.log and in the Storm UI under the Bolt Id's
    "Last error" column.
     - It looks like this is already partially documented as METRON-307
    <https://issues.apache.org/jira/browse/METRON-307>.
    
    
    From here on out I'm going to focus on Problem 1.
    
    
    *Thoughts*
    
    1. We could use multifield
    <https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html>
    mappings to be able to do both a full and partial search for fields that
    exceed 32766 length.
    
    2. Truncate fields in the indexingBolt to keep non-analyzed values below
    the 32766 limit.
    
    3. Ensure that any field with the ability to grow beyond the 32766 limit is
    analyzed, and that no single term surpasses the max term limit.
    
    There are some other ways to fix the problem, such as to not store the field
    <https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html>,
    not index the field
    <https://www.elastic.co/guide/en/elasticsearch/refe>, ignore
    fields larger than a set value
    <https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html>,
    etc. but I personally see these as confusing (to the end user) and not very
    helpful.  Others have brought up dynamic templates
    <https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html>
    as well, but I haven't looked into them yet.
    
    
    *Concerns*
    
    Thought #1
    
    - My current favourite (this is also what logstash does), but it requires
    that we analyze the whole message and store a truncated version of the
    whole message as a single large term.  If truncation occurs we would need
    to:
    
        - Add key-value pairs to the tuple that indicates that it was
    truncated, what field(s) was/were truncated, the pre-truncated size of the
    field(s), hash of pre-truncated field, and a timestamp of truncation (i.e.
    Data tampering).
    
        - Provide UI elements that clearly show that a specific message was
    truncated.
    
    - May need to abstract querying in the UI.  If so, this requires a sub-task
    to METRON-195 and looking into an interim solution with Kibana.
    
    - See "Other thoughts".
    
    
    Thought #2
    
    - If we go this path we’d need to address how to do a full string match
    (i.e. abstract a copy/paste of a > 32766 length URI to use as a query in
    the UI).  This may or may not be possible with Kibana – if not, this needs
    to be a subtask in METRON-195.
    
    - Add key-value pairs to the tuple that indicates that it was truncated,
    what field(s) was/were truncated, the pre-truncated size of the field(s),
    hash of pre-truncated field, and a timestamp of truncation (i.e. Data
    tampering).
    
    - Provide UI elements that clearly show that a specific message was
    truncated.
    
    - See "Other thoughts".
    
    
    Thought #3
    
    - Not a huge fan of this solution because of how it affects whole string
    matching.
    
    - May need a custom analyzer to cut up the URI properly.  Here <%22http> are
    <https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html>
    some
    <https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis->
    relevant <http://docs.oracle.com/javase/7/docs/api/java/net/URL.html>
    materials
    <http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html> if
    <https://tools.ietf.org/html/rfc398> we
    <https://tools.ietf.org/html/rfc3986%22%20%5Cl%20%22section-3> go
    <http://download.java.net/jdk7/archive/b123/docs/api/java/net/URI.html> that
    <http://www.regexplanet.com/advanced/java/index.html> path
    <http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html>.
    
    
    *Other thoughts*
    
    - Maybe we can use the profiler to monitor truncated=True and watch for
    people messing with this on purpose.
    
    - We could add a persistent UUID to every tuple and map HDFS against
    Elasticsearch data.  This could be used by a UI/frontend to query across
    both datastores.   Very useful in the case of truncation - provide a
    configurable setting that is false by default, but if set to true it will
    query HDFS for data it got which has truncated:true in the indexed store.
    
    
    I have more thoughts but this has gotten more than long enough already and
    I wanted to send it off today.  Thoughts?
    Jon
    -- 
    
    Jon
    





Mime
View raw message