metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zeolla@GMail.com" <zeo...@gmail.com>
Subject [DISCUSS] Search Concerns
Date Mon, 31 Oct 2016 20:38:44 GMT
Hi All,

I've been doing a bit of bug hunting with bro logs within the search tier
of Metron over the last couple of weeks.  During this (in this thread I'm
primarily referring to METRON-517
<https://issues.apache.org/jira/browse/METRON-517>) I've found a couple of
issues that I'd like to discuss with the wider population.  The primary
issue here is due to a limitation in Lucene itself, meaning we will
encounter the same problem with either Elasticsearch or Solr as far as I
can tell.

*Problems*

1. Lucene, including the latest version (6.2.1), appears to have a hard
coded maximum term length of 32766 (reference
<https://lucene.apache.org/core/6_2_1/core/constant-values.html%22%20%5Cl%20%22org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH>
here <%22> for
<https://github.com/apache/lucene-solr/search?utf8=%E2%9C%93&q=32766>
details <https://github.com/apache/lucene-solr/searc>).  If the
indexingBolt attempts to input a non_analyzed
<https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html%22%20%5Cl%20%22_index_2>
tuple (set via these files <https://github.co>) which exceeds that limit,
the entire message is rejected.
 - If you simply analyze the field it reduces the size for any individual
term, but it also throws a wrench in your queries, when you are searching
for a match of that entire field.

2. From what I can tell, failures are only logged via
/var/log/elasticsearch/metron.log and in the Storm UI under the Bolt Id's
"Last error" column.
 - It looks like this is already partially documented as METRON-307
<https://issues.apache.org/jira/browse/METRON-307>.


>From here on out I'm going to focus on Problem 1.


*Thoughts*

1. We could use multifield
<https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html>
mappings to be able to do both a full and partial search for fields that
exceed 32766 length.

2. Truncate fields in the indexingBolt to keep non-analyzed values below
the 32766 limit.

3. Ensure that any field with the ability to grow beyond the 32766 limit is
analyzed, and that no single term surpasses the max term limit.

There are some other ways to fix the problem, such as to not store the field
<https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html>,
not index the field
<https://www.elastic.co/guide/en/elasticsearch/refe>, ignore
fields larger than a set value
<https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html>,
etc. but I personally see these as confusing (to the end user) and not very
helpful.  Others have brought up dynamic templates
<https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html>
as well, but I haven't looked into them yet.


*Concerns*

Thought #1

- My current favourite (this is also what logstash does), but it requires
that we analyze the whole message and store a truncated version of the
whole message as a single large term.  If truncation occurs we would need
to:

    - Add key-value pairs to the tuple that indicates that it was
truncated, what field(s) was/were truncated, the pre-truncated size of the
field(s), hash of pre-truncated field, and a timestamp of truncation (i.e.
Data tampering).

    - Provide UI elements that clearly show that a specific message was
truncated.

- May need to abstract querying in the UI.  If so, this requires a sub-task
to METRON-195 and looking into an interim solution with Kibana.

- See "Other thoughts".


Thought #2

- If we go this path we’d need to address how to do a full string match
(i.e. abstract a copy/paste of a > 32766 length URI to use as a query in
the UI).  This may or may not be possible with Kibana – if not, this needs
to be a subtask in METRON-195.

- Add key-value pairs to the tuple that indicates that it was truncated,
what field(s) was/were truncated, the pre-truncated size of the field(s),
hash of pre-truncated field, and a timestamp of truncation (i.e. Data
tampering).

- Provide UI elements that clearly show that a specific message was
truncated.

- See "Other thoughts".


Thought #3

- Not a huge fan of this solution because of how it affects whole string
matching.

- May need a custom analyzer to cut up the URI properly.  Here <%22http> are
<https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html>
some
<https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis->
relevant <http://docs.oracle.com/javase/7/docs/api/java/net/URL.html>
materials
<http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html> if
<https://tools.ietf.org/html/rfc398> we
<https://tools.ietf.org/html/rfc3986%22%20%5Cl%20%22section-3> go
<http://download.java.net/jdk7/archive/b123/docs/api/java/net/URI.html> that
<http://www.regexplanet.com/advanced/java/index.html> path
<http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html>.


*Other thoughts*

- Maybe we can use the profiler to monitor truncated=True and watch for
people messing with this on purpose.

- We could add a persistent UUID to every tuple and map HDFS against
Elasticsearch data.  This could be used by a UI/frontend to query across
both datastores.   Very useful in the case of truncation - provide a
configurable setting that is false by default, but if set to true it will
query HDFS for data it got which has truncated:true in the indexed store.


I have more thoughts but this has gotten more than long enough already and
I wanted to send it off today.  Thoughts?

Jon
-- 

Jon

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message