db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Bruno <andrew.br...@gmail.com>
Subject Re: SELECT query takes 5 secs, what can I do?
Date Wed, 16 Sep 2009 00:17:52 GMT
Brett and others, great feedback all. thanks

The id != ... (NOT equal) was an awaking, didnt think about that.

The looping sounds weird, I agree.. but let me explain further.

The applications goes through many stages, and too much to explain
here in detail.  As its archiving email from an email server, the
first thing it does is index all messages for a user, this is simply

When it decides to transfer the message, it checks whether that
message has already been transferred for another user.  If it has, it
does not send the message again, it sends a re-index job (our own
technology) to just index the message against this other user.  This
is to save storage and bandwidth.  In an organization, with many
users, if an email is sent out, that same email is in many people
mailboxes.  We only want one copy of it archived.

So, what I am thinking of doing is to simply query the table on only
messageId, and then filter the results myself.


select * from Message m where

All I care about is getting the first message that does not have the
same Id, and that it has already been archived.  The rest are ignored.

i.e. and m.id != 933927 and m.archived=1

Once I find it in loop, this is the one I reference for the re-indexing.

What do you think?

On Wed, Sep 16, 2009 at 1:32 AM, Brett Wooldridge
<brett.wooldridge@gmail.com> wrote:
> By the way, it was Hibernate that led the original developer to be looping
> over a large collection and executing a delete of each object.  Hibernate
> can be very handy, but even Gavin and the other Hibernate guys will tell you
> it is not designed for (and is not a substitute for) bulk operations.  We
> had to "go around" Hibernate in that case and several others to execute true
> bulk operations against the database.  Use Hibernate where it is useful, but
> go to the bare metal when you need performance.
> Brett
> On Wed, Sep 16, 2009 at 12:16 AM, Brett Wooldridge
> <brett.wooldridge@gmail.com> wrote:
>> Because the query specified an id != ... (NOT equal), your specificity
>> will be extremely low (almost every row meets that criteria), therefore
>> adding id to the index will do little but increase the overhead of the
>> index.  Similarly, it seems the archived flag would offer little in the way
>> of narrowing the result set -- unless the number of archived items is
>> extremely small compared to the total row count.
>> But you said something interesting, you said it is "looping through all
>> the messages".  That sounds like more of an algorithmic issue than a DB
>> one.  For example, rather than looping and firing off 10000 select
>> statements at 5 seconds each -- processing each message -- is there a way to
>> devise 1 select (or some similary small number) that returns 10000 rows that
>> you can cursor through and process?
>> Anytime I see a pattern like that -- looping over some X programatically
>> and running a query for each -- it is a red flag.  In the past I've
>> converted similar onesy-twosy patterns to bulk operations with generally one
>> or two orders of magnitude improvement.  I worked on a project where a
>> "clean-up" job was deleting data one row at a time based on some algorithm,
>> but doing it for thousands of rows.  It was taking hours.  After some
>> thinking and creative querying (and sub-querying) we were able to generate a
>> bulk delete that took less than two minutes.  It involved creating a
>> temporary table (this was mysql) with ids to be deleted (populated by a few
>> choice queries) and then doing a bulk delete based on a join with that
>> table.
>> Anyway, the point is, rather than trying to optimize a query that is run
>> thousands of times, try to optimize the algorithm so that it doesn't need to
>> do that.
>> Just my thoughts.
>> -Brett
>> On Tue, Sep 15, 2009 at 10:02 PM, Andrew Bruno <andrew.bruno@gmail.com>
>> wrote:
>>> Thank you all for the responses.
>>> The db runs as part of an applicatioon that sits on  a windows box, and
>>> we have many installed version around the world, so we don't control
>>> hardware,etc..  It only has a few db connections, and at the momement this
>>> particular instancee is looping through all the messages, and seeing whether
>>> they have been archived in the  cloud yet... And as you can imagine, at
>>> 5secs a message its taking  a long time.
>>> The message table has a bunch of other columns, and since I am using
>>> hibernate I need all columns... But I wonder if it would be quicker to
>>> select just the id column, and then load/select all columns using where
>>> id=.. Sounds crazy.. Kinda like 1+N selects...But any thoughts?
>>> As for the compund index, I thought it was not possible to add a pk index
>>> I.e. Id as part of a compound, but either way I will try.
>>> Also, if I do a count(messageid) instead, are there any other
>>> optimization tricks?
>>> Keep all thoughts coming, crazy or sound :)
>>> Appreciated
>>> Andrew
>>> -----Original Message-----
>>> From: Rick Hillegas <Richard.Hillegas@Sun.COM>
>>> Sent: Tuesday, 15 September 2009 10:32 PM
>>> To: Derby Discussion <derby-user@db.apache.org>
>>> Subject: Re: SELECT query takes 5 secs, what can I do?
>>> Hi Andrew,
>>> You might try adding more columns to your index so that it covers the
>>> whole WHERE clause:
>>> CREATE INDEX IDX_Message_MessageId ON ExchangeSync.Message (messageId,
>>> archived, id)
>>> Hope this helps,
>>> -Rick
>>> Andrew Bruno wrote:
>>> > Hello,
>>> >
>>> > I have a query that used to take 10secs to run, i.e.
>>> >
>>> > select * from Message m where
>>> > m.messageId='<7997716ED1AF3D47A35D74FA2CB610920255303F@somedomain.com>'
>>> > and m.id != 933927 and m.archived=1
>>> >
>>> > The Message table has around one million rows.
>>> >
>>> > I added the following index
>>> >
>>> > CREATE INDEX IDX_Message_MessageId ON ExchangeSync.Message (messageId)
>>> >
>>> > and now it takes 5secs.
>>> >
>>> > Is there anything else I can do?
>>> >
>>> > Should I add an index on the boolean "archived" column too?
>>> >
>>> > Any performance hints appreciated.
>>> >
>>> > Thanks
>>> > Andrew
>>> >

View raw message