db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Doran <matt.do...@papercut.com>
Subject Advice on *very* badly performing query (with reproduction recipe)
Date Mon, 03 Dec 2007 04:52:01 GMT
Hi there,

We use Apache Derby in our commercial application, PaperCut NG 
<http://www.papercut.com/>.  It's proven to be very reliable, however we 
occasionally get reports of very bad performance in some areas.  We 
haven't had the time to investigate them fully previously (usually 
upgrading to an external DB like Postgres or SQL Server fixes the 
issue).  This time we had a look in more detail with a recent report, 
and we've found some very strange performance characteristics ... and 
would love some advice and assistance.

We have a query that is doing inner joins to 5 tables.  It's quite a 
simple query, but the core table has about 300,000 rows, and where 
limiting the results based on a date in that table that is indexed.  
Here's a summary of my situation/findings:

    * Using the latest Derby release, with a Java 1.5 VM on
    * We only have a single WHERE clause, which is on the indexed date
      field is restricting the data such that no data is returned.  e.g.
      log_date > (latest log date).  So derby should quickly detect
      there is litte/no data to return.
    * Running the original query takes 22 minutes running 100% CPU.
    * Running a count(*) for the same query is quick (< 1 sec).
    * Removing the ORDER BY and changing the select list to just include
      a single field from each table and it still takes 22 minutes.
    * Changing the select list to retrieve only a single field from 2 of
      the table and it still takes 22 minutes (I have a log of the query
      and the runtime stats for this attached "derby-slow.log").
    * Changing the select list to a single field from 1 of the tables
      makes the query run fast - less than a second. (I have a log of
      the query and the runtime stats for this attached "derby-fast.log").
    * Running the original query on the same dataset in PostgreSQL or
      SQL Server is very fast (less than a second).  This is why we
      often recommend customers upsize to Postgres or SQL Server.
    * Also the SQL is generated via Hibernate ORM, so we have some
      limitations in how we can modify the SQL.

 From the query plan it seems that seems that it stops using the date 
index on the "tbl_printer_usage_log" log table, and changes from Hash 
joins to Nested Loop joins.  On a large table like this when providing a 
where clause that on a field that is indexed .... we have to ensure that 
derby uses the index.

If I increase the pageCacheSize to 100,000 pages, it reduces the time of 
the query to about 2-3 minutes, but it's still very slow compared to 
when the correct index is used.

Can anyone please help me understand the following:

    * Why does the query plan change dramatically, just by changing the
      fields that are retrieved?
    * Why is derby avoiding the most obvious index?  The date field in
      the 300,000 row table (the date field is the only field in our
      where clause).
    * Is there anyway to avoid this behavior?

If we can understand what's causing this, we'll be able to make a much 
more effective use of Derby.  At the moment, on customers with large 
datasets, we currently just recommend they "upsize" to Postgres or SQL 
Server and the problem goes away.  However, we'd much prefer to fix this 
and have our Derby database behave better.

I'd be happy to provide the derby database that exhibits these problems 
if someone would like to see what's going on.  The database is from a 
customer, so I don't want to post it publicly, but if you send me an 
email off-list I'd be happy to provide it.


Matt Doran
PaperCut Software International Pty. Ltd.
Phone:   +61 (3) 9807 5767
E-mail:  matt.doran@papercut.com
Profile: http://www.papercut.com/about/#matt
Blog:    http://www.papercut.com/blog/

View raw message