lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject Re: Solr -indexing from csv file having 28 cols taking lot of time ..plz help i m new to solr
Date Fri, 03 Apr 2015 19:23:36 GMT
avinash09 <avinash.it09@gmail.com> wrote:
> regex="^(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),
> (.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)$"

A better solution seems to have been presented, but for the record I would like to note that
the regexp above is quite an effective performance bomb: For each group, the evaluation time
roughly doubles. Not a problem for 10 groups, but you have 28.

I made a little test and matching a single sample line with 20 groups took 120 ms/match, 24
groups took 2 seconds and 28 groups took 30 seconds on my machine. If you had 50 groups, a
single match would take 4 years.

The explanation is that Java regexps are greedy: Every one of your groups starts by matching
to the end of the line, then a comma is reached in the regexp and it backtracks. The solution
is fortunately both simple and applicable to many other regexps: Make your matches terminate
as soon as possible.

In this case, instead of having groups with (.*), use ([^,]*) instead, which means that each
group matches everything, except commas. The combined regexp then looks like this:
regex="^([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),...([^,]*)$"

The match speed for 28 groups with that regexp was about 0.002ms (average over 1000 matches).

- Toke Eskildsen

Mime
View raw message