lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Indexing and searching a DateTime range
Date Tue, 10 Feb 2015 09:35:47 GMT
Hi,

> OK. I found the Alfresco code on GitHub. So it's open source it seems.
> 
> And I found the DateTimeAnalyser, so I will just take that code as a starting
> point:
> https://github.com/lsbueno/alfresco/tree/master/root/projects/repository/
> source/java/org/alfresco/repo/search/impl/lucene/analysis

This won't help you:
a) its outdated code from very early Lucene versions
b) it would be slow, because it does not use the numeric features of Lucene, so your code
would be very slow if you search for date ranges

Basically, I don't really understand your problem:
If you use Lucene directly you are responsible for processing the text before it goes into
the index. If you want to create a Lucene Document per Line, it is your turn to do this. Lucene
has no functionality to split documents. You have to process your input and bring it into
a format that Lucene wants: "Documents" consisting of "Key/Value" pairs. Analyzers are only
there for processing one specific field and tokenize the input (so the index contains words
and not the whole field as one term). Analyzers have nothing to do with Analysis of the structure
of Log lines (because they would only work on one field, which does not help for structured
queries like on date).

So basically your indexing workflow is:

- Open Log file
- Read log file line by line
- Create a Lucene IndexDocument instance
- Extract "interesting" key/value pairs from your log file, e.g. by using regular expressions
(like Logstash does). Basically this would for example "detect" the date, class name from
Log4J files, or whatever else
- Put those key/value pairs as fields (numeric, text,...)  to the Lucene IndexDocument: One
field for the date, one field for message content, one field for classname,... (those fields
don't need to be stored, unless you want to display only them in search results, see below).
- In addition, it is wise to add an additional Lucene TextField instance (that is also STORED=TRUE,
INDEXED=TRUE with good Analyzer) that contains the whole line (redundant). By STORING it,
you are able to return the whole log line in your search results
- Index the document
- Process next line

If you don't want to write this code on your own, use Logstash and Elasticsearch (or write
a separate plugin for Logstash that indexes to lucene). But your comment is strange: You say:
Elasticsearch and Logstah is too slow for many log lines. How should then Lucene be faster?
Elasticsearch also uses Lucene under the hood. The main problem if its slow is in most cases
incorrect data types while indexing (like using a text field for dates and doing ranges).
It is the same like indexing a number in a relational database as String and then do "like"
queries instead of real numeric comparisons - just wrong and slow.

Uwe

> Thank you for everybody for the time to respond.
> 
> 2015-02-10 9:55 GMT+09:00 Gergely Nagy <fogetti@gmail.com>:
> 
> > Thank you Barry, I really appreciate your time to respond,
> >
> > Let me clarify this a little bit more. I think it was not clear.
> >
> > I know how to parse dates, this is not the question here. (See my
> > previous
> > email: "how can I pipe my converter logic into the indexing process?")
> >
> > All of your solutions guys would work fine if I wanted to index
> > per-document. Which I do NOT want to do. What I would like to do to
> > index per log line.
> >
> > I need to do a full text search, but with the additional requirement
> > to filter those search hits by DateTime range.
> >
> > I hope this makes it clearer. So any suggestions how to do that?
> >
> > Sidenote: I saw that Alfresco implemented this analyzer, called
> > DateTimeAnalyzer, but Alfresco is not open source. So I was wondering
> > how to implement the same. Actually after wondering for 2 days, I
> > became convinced that writing an Analyzer should be the way to go. I
> > will post my solution later if I have a working code.
> >
> > 2015-02-10 8:50 GMT+09:00 Barry Coughlan <b.coughlan2@gmail.com>:
> >
> >> Hi Gergely,
> >>
> >> Writing an analyzer would work but it is unnecessarily complicated.
> >> You could just parse the date from the string in your input code and
> >> index it in the LongField like this:
> >>
> >> SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd
> >> HH:mm:ss.S'Z'"); format.setTimeZone(TimeZone.getTimeZone("UTC"));
> >> long t = format.parse("2015-02-08 00:02:06.123Z INFO...").getTime();
> >>
> >> Barry
> >>
> >> On Tue, Feb 10, 2015 at 12:21 AM, Gergely Nagy <fogetti@gmail.com>
> wrote:
> >>
> >> > Thank you for taking your time to respond Karthik,
> >> >
> >> > Can you show me an example how to convert DateTime to milliseconds?
> >> > I
> >> mean
> >> > how can I pipe my converter logic into the indexing process?
> >> >
> >> > I suspect I need to write my own Analyzer/Tokenizer to achieve
> >> > this. Is this correct?
> >> >
> >> > 2015-02-09 22:58 GMT+09:00 KARTHIK SHIVAKUMAR
> <nskarthik.k@gmail.com>:
> >> >
> >> > > Hi
> >> > >
> >> > > Long time ago,.. I used to store datetime in millisecond .
> >> > >
> >> > > TermRangequery used to work in perfect condition....
> >> > >
> >> > > Convert all datetime to millisecond and index the same.
> >> > >
> >> > > On search condition again convert datetime to millisecond and use
> >> > > TermRangequery.
> >> > >
> >> > > With regards
> >> > > Karthik
> >> > > On Feb 9, 2015 1:24 PM, "Gergely Nagy" <fogetti@gmail.com> wrote:
> >> > >
> >> > > > Hi Lucene users,
> >> > > >
> >> > > > I am in the beginning of implementing a Lucene application
> >> > > > which
> >> would
> >> > > > supposedly search through some log files.
> >> > > >
> >> > > > One of the requirements is to return results between a time range.
> >> > Let's
> >> > > > say these are two lines in a series of log files:
> >> > > > 2015-02-08 00:02:06.852Z INFO...
> >> > > > ...
> >> > > > 2015-02-08 18:02:04.012Z INFO...
> >> > > >
> >> > > > Now I need to search for these lines and return all the text
> >> > in-between.
> >> > > I
> >> > > > was using this demo application to build an index:
> >> > > >
> >> > > >
> >> > >
> >> >
> >> http://lucene.apache.org/core/4_10_3/demo/src-
> html/org/apache/lucene/
> >> demo/IndexFiles.html
> >> > > >
> >> > > > After that my first thought was using a term range query like
this:
> >> > > >         TermRangeQuery query =
> >> > TermRangeQuery.newStringRange("contents",
> >> > > > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true,
> >> > > > true);
> >> > > >
> >> > > > But for some reason this didn't return any results.
> >> > > >
> >> > > > Then I was Googling for a while how to solve this problem, but
> >> > > > all
> >> the
> >> > > > datetime examples I found are searching based on a much simpler
> >> field.
> >> > > > Those examples usually use a field like this:
> >> > > > doc.add(new LongField("modified", file.lastModified(),
> >> Field.Store.NO
> >> > ));
> >> > > >
> >> > > > So I was wondering, how can I index these log files to make a
> >> > > > range
> >> > query
> >> > > > work on them? Any ideas? Maybe my approach is completely
> wrong.
> >> > > > I am
> >> > > still
> >> > > > new to Lucene so any help is appreciated.
> >> > > >
> >> > > > Thank you.
> >> > > >
> >> > > > Gergely Nagy
> >> > > >
> >> > >
> >> >
> >>
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message