lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gergely Nagy <foge...@gmail.com>
Subject Re: Indexing and searching a DateTime range
Date Mon, 09 Feb 2015 08:55:39 GMT
Thank you for the great answer Uwe!

Sadly my department rejected the above combination of using Logstash +
Elasticsearch. According to their experience, elastic search works fine on
about 3 days of log data, but slows terribly down providing the magnitude
of 3 months of data or so.

But I will take a look at Logstash anyway. After skimming through Logstash
documentation I can see that there are so called Logstash "outputs":
http://logstash.net/docs/1.4.2/tutorials/getting-started-with-logstash

What do you think, is it possible to use Logstash as a preprocessor which
outputs the filtered logs and feeds them into my Lucene app?

Or if that's not a good idea, can you elaborate on how can I do this
preprocessing you are referring to? Do you mean implementing an Analyzer
like these?
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html

Thank you,
Gergely Nagy

2015-02-09 17:10 GMT+09:00 Uwe Schindler <uwe@thetaphi.de>:

> Hi,
>
> > I am in the beginning of implementing a Lucene application which would
> > supposedly search through some log files.
> >
> > One of the requirements is to return results between a time range. Let's
> say
> > these are two lines in a series of log files:
> > 2015-02-08 00:02:06.852Z INFO...
> > ...
> > 2015-02-08 18:02:04.012Z INFO...
> >
> > Now I need to search for these lines and return all the text in-between.
> I was
> > using this demo application to build an index:
> > http://lucene.apache.org/core/4_10_3/demo/src-
> > html/org/apache/lucene/demo/IndexFiles.html
> >
> > After that my first thought was using a term range query like this:
> >         TermRangeQuery query =
> > TermRangeQuery.newStringRange("contents",
> > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true, true);
> >
> > But for some reason this didn't return any results.
>
> Lucene tokenizes the text, so you can search for terms ("words"). Those
> dates are splitted into several terms. In general, this is not the way to
> search on numeric / date range:
> - it is horribly slow, because there are many terms in that "content"
> field.
>
> > Then I was Googling for a while how to solve this problem, but all the
> > datetime examples I found are searching based on a much simpler field.
> > Those examples usually use a field like this:
> > doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
>
> That is the way to do it. Log files are "structured", so you need to do
> preprocessing. You have to put the different information into different
> fields (like the "modified" field, as mentioned in your example). You can
> still fill the "contents" field as you did above with all information to do
> plain fulltext search (like finding a log line based on some message
> contents), but in addition, you use other fields for more specific searches
> like ranges. In Lucene you generally fill several fields with the redundant
> information (like dates in fulltext field and some extra timestamp field).
>
> The information you return to the user can be put into a "stored" only
> field. This one is returned with search results.
>
> > So I was wondering, how can I index these log files to make a range query
> > work on them? Any ideas? Maybe my approach is completely wrong. I am
> > still new to Lucene so any help is appreciated.
>
> The first aproach is wrong, the second approach is right. You just have to
> make your field definitions correct.
>
> An alternative would be to use Logstash in combination with Elasticsearch,
> which is based on Lucene. This has everything you want to do already
> implemented for log files.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message