manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Performance Problems Continue
Date Wed, 17 Nov 2010 02:43:28 GMT
I've done a number of runs, added documents willy-nilly, and brought
the document count up to about 50,000.  I am still seeing reasonable
plans for this query:

dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(
SELECT 'x' FROM carrydown t1 WHERE parentidhash IN ('1C33AEE63094C42C9537A9D94E8
CB2B903764190') AND t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND t1.
jobid=1289847478234 AND t1.isnew='B');
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.00..16.63 rows=1 width=127)
   Join Filter: ((t0.dochash)::text = (t1.childidhash)::text)
   ->  Index Scan using i1289954442565 on jobqueue t0
(cost=0.00..8.35 rows=1 width=127)
         Index Cond: (jobid = 1289847478234::bigint)
   ->  Index Scan using i1289954442546 on carrydown t1
(cost=0.00..8.27 rows=1 width=98)
         Index Cond: ((t1.jobid = 1289847478234::bigint) AND
((t1.parentidhash)::text =
'1C33AEE63094C42C9537A9D94E8CB2B903764190'::text))
         Filter: (t1.isnew = 'B'::bpchar)
(7 rows)


So I suggest you upgrade to 8.4.4, and let me know what happens then.

Thanks,
Karl


On Tue, Nov 16, 2010 at 8:15 PM, Farzad Valad <hokie@farzad.net> wrote:
> Sounds perfect!
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: Tuesday, November 16, 2010 5:12 PM
> To: connectors-user@incubator.apache.org
> Subject: Re: Performance Problems Continue
>
> Turns out my memory is incorrect - the postgresql version that I've
> been using is 8.4.4, not 8.3.9.
> So there are a lot of possibilities here.
>
> What I'm going to try is to install the latest 8.4.5 on a different
> system, and do a large crawl.  Then I'll see if the plan is OK.  If it
> is OK, then I'll ask you to upgrade to that version as well, and see
> how you do.
>
> Does this sound reasonable?
> Karl
>
>
> On Tue, Nov 16, 2010 at 4:23 PM, Karl Wright <daddywri@gmail.com> wrote:
>> I'm concerned that if I install postgresql 8.3.12, and do not
>> reproduce your problem, we'll have learned nothing.
>> I'll look around to see if I have the 8.3.9 installer still around...
>>
>> Karl
>>
>>
>> On Tue, Nov 16, 2010 at 2:37 PM,  <hokie@farzad.net> wrote:
>>> Can't find the windows binaries on the web.  I found this link, but it
>>> keeps saying file not found regardless of the mirror server I use.
>>>
>>>
> http://wwwmaster.postgresql.org/download/mirrors-ftp/binary/v8.3.9/win32/pos
> tgresql-8.3.9-1.zip
>>>
>>> Seems like the fastest path would be for you to go up to .12, especially
>>> that is the version anyone can get their hands on.  Do you have the
>>> windows binaries or installer?
>>>
>>> Thanks,Farzad.
>>>
>>>
>>> Hi Farzad,
>>> The index it should be using but isn't is:
>>>
>>> public     | jobqueue  | i1289847375558 |            | CREATE UNIQUE
>>> INDEX i1289847375558 ON jobqueue USING btree (dochash, jobid)
>>>
>>> Both the dochash value and the jobid value are specified for the
>>> jobqueue table in that query, so there's no reason it should not use
>>> this index.  It's either failed to consider it at all, or it has
>>> concluded that it's more efficient to scan the entire jobqueue table
>>> rather than using the index for the lookup.
>>>
>>> FWIW, I did run some explains on similar queries on my test system
>>> this morning before I left for work, and the plans for those involved
>>> something called an "index semi-join" which apparently means it scans
>>> both the jobqueue index and the carrydown index at the same time.  If
>>> I recall correctly, my version of PostgreSQL was 8.3.9.  Since yours
>>> is 8.3.12 it could well be that something was broken in PostgreSQL
>>> between releases.  If you have time, you might try to see if an
>>> earlier version of postgresql exhibits the same behavior for you.  If
>>> that works properly, the next step would be to create a ticket against
>>> PostgreSQL itself - or maybe there already is one.
>>>
>>> I expect that any of the many PostgreSQL books out there would have
>>> enough information in it to bring you up to speed.  The skill set for
>>> optimizing queries is also portable to other databases such as Oracle,
>>> if you care.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Nov 16, 2010 at 12:03 PM,  <hokie@farzad.net> wrote:
>>>> Found a query statement to pull the index data.  Is this what you are
>>>> looking for?  On a separate note, I'm a bit over my head in terms of
>>>> understanding, any recommendation of a site/sites or book to read to
>>>> catch
>>>> up with needed db concepts?
>>>>
>>>> dbname=> select * from pg_indexes where tablename='jobqueue';
>>>>  schemaname | tablename |   indexname    | tablespace |
>>>>              indexdef
>>>>
> ------------+-----------+----------------+------------+---------------------
> ----
>>>> -------------------------------------------------------------
>>>>  public     | jobqueue  | jobqueue_pkey  |            | CREATE
UNIQUE
>>>> INDEX jobq
>>>> ueue_pkey ON jobqueue USING btree (id)
>>>>  public     | jobqueue  | i1289847375565 |            | CREATE
INDEX
>>>> i1289847375
>>>> 565 ON jobqueue USING btree (jobid, status)
>>>>  public     | jobqueue  | i1289847375564 |            | CREATE
INDEX
>>>> i1289847375
>>>> 564 ON jobqueue USING btree (jobid, isseed)
>>>>  public     | jobqueue  | i1289847375563 |            | CREATE
INDEX
>>>> i1289847375
>>>> 563 ON jobqueue USING btree (jobid, dochash, status)
>>>>  public     | jobqueue  | i1289847375562 |            | CREATE
INDEX
>>>> i1289847375
>>>> 562 ON jobqueue USING btree (status)
>>>>  public     | jobqueue  | i1289847375561 |            | CREATE
INDEX
>>>> i1289847375
>>>> 561 ON jobqueue USING btree (checkaction, checktime, status)
>>>>  public     | jobqueue  | i1289847375560 |            | CREATE
INDEX
>>>> i1289847375
>>>> 560 ON jobqueue USING btree (priorityset, status)
>>>>  public     | jobqueue  | i1289847375559 |            | CREATE
INDEX
>>>> i1289847375
>>>> 559 ON jobqueue USING btree (docpriority)
>>>>  public     | jobqueue  | i1289847375558 |            | CREATE
UNIQUE
>>>> INDEX i128
>>>> 9847375558 ON jobqueue USING btree (dochash, jobid)
>>>> (9 rows)
>>>>
>>>> dbname=> select * from pg_indexes where tablename='carrydown';
>>>>  schemaname | tablename |   indexname    | tablespace |
>>>>                               indexdef
>>>>
>>>>
> ------------+-----------+----------------+------------+---------------------
> ----
>>>>
> ----------------------------------------------------------------------------
> ----
>>>> ----------------
>>>>  public     | carrydown | i1289847375547 |            | CREATE INDEX
>>>> i1289847375
>>>> 547 ON carrydown USING btree (jobid, parentidhash)
>>>>  public     | carrydown | i1289847375546 |            | CREATE INDEX
>>>> i1289847375
>>>> 546 ON carrydown USING btree (jobid, childidhash, dataname)
>>>>  public     | carrydown | i1289847375545 |            | CREATE INDEX
>>>> i1289847375
>>>> 545 ON carrydown USING btree (jobid, childidhash, isnew)
>>>>  public     | carrydown | i1289847375544 |            | CREATE UNIQUE
>>>> INDEX i128
>>>> 9847375544 ON carrydown USING btree (jobid, parentidhash, childidhash,
>>>> dataname,
>>>>  datavaluehash)
>>>> (4 rows)
>>>>
>>>> dbname=>
>>>>
>>>> Farzad.
>>>>
>>>>
>>>> psql usually displays the indexes for a table when you dt the table.
>>>> These are not appearing in your postgresql version.  But it's clear
>>>> they are there from the plan.
>>>>
>>>> The jobqueue table is being sequentially scanned *despite* the
>>>> probable fact that there's a reasonable, usable index on it that it
>>>> should be using.  In short, this looks like a PostgreSQL bug to me.
>>>>
>>>> I'm away from my test machine right now, but I'd love to see the
>>>> indexes on the jobqueue table that you have.  There may be a new
>>>> command in psql that my version of PostgreSQL doesn't have that prints
>>>> the indexes for a table, if they took it out of dt.
>>>>
>>>> If the proper index is there, then all we can do is try various forms
>>>> of the query to see if we can dodge the PostgreSQL bug.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Nov 16, 2010 at 9:31 AM, Farzad Valad <hokie@farzad.net> wrote:
>>>>> Here is the output:
>>>>>
>>>>>
>>>>> C:\Program Files (x86)\PostgreSQL\8.3\bin>psql -U manifoldcf dbname
>>>>> Password for user manifoldcf:
>>>>> Welcome to psql 8.3.12, the PostgreSQL interactive terminal.
>>>>>
>>>>> Type:  \copyright for distribution terms
>>>>>       \h for help with SQL commands
>>>>>       \? for help with psql commands
>>>>>       \g or terminate with semicolon to execute query
>>>>>       \q to quit
>>>>>
>>>>> Warning: Console code page (437) differs from Windows code page (1252)
>>>>>         8-bit characters might not work correctly. See psql reference
>>>>>         page "Notes for Windows users" for details.
>>>>>
>>>>> dbname=> \dt jobqueue;
>>>>>           List of relations
>>>>>  Schema |   Name   | Type  |   Owner
>>>>> --------+----------+-------+------------
>>>>>  public | jobqueue | table | manifoldcf
>>>>> (1 row)
>>>>>
>>>>> dbname=> \dt carrydown;
>>>>>            List of relations
>>>>>  Schema |   Name    | Type  |   Owner
>>>>> --------+-----------+-------+------------
>>>>>  public | carrydown | table | manifoldcf
>>>>> (1 row)
>>>>>
>>>>> dbname=> select count(*) from carrydown;
>>>>>  count
>>>>> -------
>>>>>     0
>>>>> (1 row)
>>>>>
>>>>> dbname=> select distinct jobid from carrydown;
>>>>>  jobid
>>>>> -------
>>>>> (0 rows)
>>>>>
>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0
>>>>> WHERE
>>>>> EXISTS(
>>>>> SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>> t1.childidhash=t0.doch
>>>>> ash AND t0.jobid=1289847478234 AND
>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>
>>>>>     QUERY PLAN
>>>>>
>>>>>
> ----------------------------------------------------------------------------
>>>>> ----
>>>>>
> ----------------------------------------------------------------------------
>>>>> ----
>>>>> --------------------
>>>>>  Seq Scan on jobqueue t0  (cost=0.00..2000133.07 rows=118994 width=158)
>>>>>   Filter: (subplan)
>>>>>   SubPlan
>>>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>           ->  Index Scan using i1289847375544 on carrydown t1
>>>>> (cost=0.00..8.28
>>>>>  rows=1 width=0)
>>>>>                 Index Cond: ((jobid = 1289847478234::bigint)
AND
>>>>> ((parentidhash
>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>> ((childidhash)::
>>>>> text = ($0)::text))
>>>>>                 Filter: (isnew = 'B'::bpchar)
>>>>> (8 rows)
>>>>>
>>>>> dbname=> analyze carrydown;
>>>>> ANALYZE
>>>>> dbname=> analyze jobqueue;
>>>>> ANALYZE
>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0
>>>>> WHERE
>>>>> dbname-> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>> dbname(> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>
>>>>>     QUERY PLAN
>>>>>
>>>>>
> ----------------------------------------------------------------------------
>>>>> ----
>>>>>
> ----------------------------------------------------------------------------
>>>>> ----
>>>>> --------------------
>>>>>  Seq Scan on jobqueue t0  (cost=0.00..2091946.86 rows=124532 width=158)
>>>>>   Filter: (subplan)
>>>>>   SubPlan
>>>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>           ->  Index Scan using i1289847375544 on carrydown t1
>>>>> (cost=0.00..8.28
>>>>>  rows=1 width=0)
>>>>>                 Index Cond: ((jobid = 1289847478234::bigint)
AND
>>>>> ((parentidhash
>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>> ((childidhash)::
>>>>> text = ($0)::text))
>>>>>                 Filter: (isnew = 'B'::bpchar)
>>>>> (8 rows)
>>>>>
>>>>> dbname=>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Karl Wright [mailto:daddywri@gmail.com]
>>>>> Sent: Tuesday, November 16, 2010 4:21 AM
>>>>> To: connectors-user@incubator.apache.org
>>>>> Subject: Re: Performance Problems Continue
>>>>>
>>>>> As I suspected, the plan it is generating is crap:
>>>>>
>>>>> Found a query that took more than a minute (62494 ms): [SELECT
>>>>> t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x'
>>>>> FROM carrydown t1 WHERE parentidhash IN (?) AND
>>>>> t1.childidhash=t0.dochash AND t0.jobid=? AND t1.jobid=? AND
>>>>> t1.isnew=?)]
>>>>>  Parameter 0: 'B0C2E57717919821A7C32D9F92F9F297155B727F'
>>>>>  Parameter 1: '1289847478234'
>>>>>  Parameter 2: '1289847478234'
>>>>>  Parameter 3: 'B'
>>>>>  Plan: Seq Scan on jobqueue t0  (cost=0.00..1332448.02 rows=79465
>>>>> width=152)
>>>>>  Plan:   Filter: (subplan)
>>>>>  Plan:   SubPlan
>>>>>  Plan:     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>  Plan:           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>  Plan:           ->  Index Scan using i1289847375544 on carrydown
t1
>>>>> (cost=0.00..8.28 rows=1 width=0)
>>>>>  Plan:                 Index Cond: ((jobid = 1289847478234::bigint)
>>>>> AND ((parentidhash)::text =
>>>>> 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>> ((childidhash)::text = ($0)::text))
>>>>>  Plan:                 Filter: (isnew = 'B'::bpchar)
>>>>>
>>>>>
>>>>>
>>>>> ... which means that it is doing a sequential scan on the jobqueue
>>>>> table (large) and an index scan for each row on the carrydown table
>>>>> (which should be empty), rather than the other way round.  There is
a
>>>>> carrydown index that is should be using instead, which it is
>>>>> completely ignoring, which is all the more weird.
>>>>>
>>>>> Can you do the following:
>>>>>
>>>>> (1) Under the postgresql bin directory, run psql -U manifoldcf dbname
>>>>> (2) If the defaults are in place, password should be "local_pg_passwd"
>>>>> (3) Run the following commands, and cut/paste the answers for me:
>>>>>
>>>>> \dt jobqueue;
>>>>> \dt carrydown;
>>>>> select count(*) from carrydown;
>>>>> select distinct jobid from carrydown;
>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>
>>>>> Once you have done that, please do the following:
>>>>>
>>>>> analyze carrydown;
>>>>> analyze jobqueue;
>>>>>
>>>>> Then, try the explain again. and send me that output too:
>>>>>
>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>
>>>>>
>>>>> I've tried the same commands on my system, but they show reasonable
>>>>> plans.  PostgreSQL, like all databases, develops its plans based on
>>>>> statistics it keeps for the tables in question.  The analyze command
>>>>> tells it to update those statistics, which in theory should make
>>>>> PostgreSQL immediately thereafter use a correct plan.  However, there
>>>>> may be great sensitivity to incorrect statistics for some sorts of
>>>>> query, depending on the conditions, so a small number of accumulated
>>>>> changes can cause it to start to make huge errors.  I am trying to
>>>>> determine if that's what is happening, and what those conditions are
>>>>> in your case.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Nov 15, 2010 at 9:44 PM, Farzad Valad <hokie@farzad.net>
wrote:
>>>>>> 1. Using 8.3.12, gave up on 9.x
>>>>>> 2. Yes, I do.  Attached the log file for this crawl
>>>>>> 3. Using Null Output and File System input, not using RSS
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>

Mime
View raw message