manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Performance Problems Continue
Date Tue, 16 Nov 2010 21:23:14 GMT
I'm concerned that if I install postgresql 8.3.12, and do not
reproduce your problem, we'll have learned nothing.
I'll look around to see if I have the 8.3.9 installer still around...

Karl


On Tue, Nov 16, 2010 at 2:37 PM,  <hokie@farzad.net> wrote:
> Can't find the windows binaries on the web.  I found this link, but it
> keeps saying file not found regardless of the mirror server I use.
>
> http://wwwmaster.postgresql.org/download/mirrors-ftp/binary/v8.3.9/win32/postgresql-8.3.9-1.zip
>
> Seems like the fastest path would be for you to go up to .12, especially
> that is the version anyone can get their hands on.  Do you have the
> windows binaries or installer?
>
> Thanks,Farzad.
>
>
> Hi Farzad,
> The index it should be using but isn't is:
>
> public     | jobqueue  | i1289847375558 |            | CREATE UNIQUE
> INDEX i1289847375558 ON jobqueue USING btree (dochash, jobid)
>
> Both the dochash value and the jobid value are specified for the
> jobqueue table in that query, so there's no reason it should not use
> this index.  It's either failed to consider it at all, or it has
> concluded that it's more efficient to scan the entire jobqueue table
> rather than using the index for the lookup.
>
> FWIW, I did run some explains on similar queries on my test system
> this morning before I left for work, and the plans for those involved
> something called an "index semi-join" which apparently means it scans
> both the jobqueue index and the carrydown index at the same time.  If
> I recall correctly, my version of PostgreSQL was 8.3.9.  Since yours
> is 8.3.12 it could well be that something was broken in PostgreSQL
> between releases.  If you have time, you might try to see if an
> earlier version of postgresql exhibits the same behavior for you.  If
> that works properly, the next step would be to create a ticket against
> PostgreSQL itself - or maybe there already is one.
>
> I expect that any of the many PostgreSQL books out there would have
> enough information in it to bring you up to speed.  The skill set for
> optimizing queries is also portable to other databases such as Oracle,
> if you care.
>
> Karl
>
>
> On Tue, Nov 16, 2010 at 12:03 PM,  <hokie@farzad.net> wrote:
>> Found a query statement to pull the index data.  Is this what you are
>> looking for?  On a separate note, I'm a bit over my head in terms of
>> understanding, any recommendation of a site/sites or book to read to
>> catch
>> up with needed db concepts?
>>
>> dbname=> select * from pg_indexes where tablename='jobqueue';
>>  schemaname | tablename |   indexname    | tablespace |
>>              indexdef
>> ------------+-----------+----------------+------------+-------------------------
>> -------------------------------------------------------------
>>  public     | jobqueue  | jobqueue_pkey  |            | CREATE UNIQUE
>> INDEX jobq
>> ueue_pkey ON jobqueue USING btree (id)
>>  public     | jobqueue  | i1289847375565 |            | CREATE INDEX
>> i1289847375
>> 565 ON jobqueue USING btree (jobid, status)
>>  public     | jobqueue  | i1289847375564 |            | CREATE INDEX
>> i1289847375
>> 564 ON jobqueue USING btree (jobid, isseed)
>>  public     | jobqueue  | i1289847375563 |            | CREATE INDEX
>> i1289847375
>> 563 ON jobqueue USING btree (jobid, dochash, status)
>>  public     | jobqueue  | i1289847375562 |            | CREATE INDEX
>> i1289847375
>> 562 ON jobqueue USING btree (status)
>>  public     | jobqueue  | i1289847375561 |            | CREATE INDEX
>> i1289847375
>> 561 ON jobqueue USING btree (checkaction, checktime, status)
>>  public     | jobqueue  | i1289847375560 |            | CREATE INDEX
>> i1289847375
>> 560 ON jobqueue USING btree (priorityset, status)
>>  public     | jobqueue  | i1289847375559 |            | CREATE INDEX
>> i1289847375
>> 559 ON jobqueue USING btree (docpriority)
>>  public     | jobqueue  | i1289847375558 |            | CREATE UNIQUE
>> INDEX i128
>> 9847375558 ON jobqueue USING btree (dochash, jobid)
>> (9 rows)
>>
>> dbname=> select * from pg_indexes where tablename='carrydown';
>>  schemaname | tablename |   indexname    | tablespace |
>>                               indexdef
>>
>> ------------+-----------+----------------+------------+-------------------------
>> --------------------------------------------------------------------------------
>> ----------------
>>  public     | carrydown | i1289847375547 |            | CREATE INDEX
>> i1289847375
>> 547 ON carrydown USING btree (jobid, parentidhash)
>>  public     | carrydown | i1289847375546 |            | CREATE INDEX
>> i1289847375
>> 546 ON carrydown USING btree (jobid, childidhash, dataname)
>>  public     | carrydown | i1289847375545 |            | CREATE INDEX
>> i1289847375
>> 545 ON carrydown USING btree (jobid, childidhash, isnew)
>>  public     | carrydown | i1289847375544 |            | CREATE UNIQUE
>> INDEX i128
>> 9847375544 ON carrydown USING btree (jobid, parentidhash, childidhash,
>> dataname,
>>  datavaluehash)
>> (4 rows)
>>
>> dbname=>
>>
>> Farzad.
>>
>>
>> psql usually displays the indexes for a table when you dt the table.
>> These are not appearing in your postgresql version.  But it's clear
>> they are there from the plan.
>>
>> The jobqueue table is being sequentially scanned *despite* the
>> probable fact that there's a reasonable, usable index on it that it
>> should be using.  In short, this looks like a PostgreSQL bug to me.
>>
>> I'm away from my test machine right now, but I'd love to see the
>> indexes on the jobqueue table that you have.  There may be a new
>> command in psql that my version of PostgreSQL doesn't have that prints
>> the indexes for a table, if they took it out of dt.
>>
>> If the proper index is there, then all we can do is try various forms
>> of the query to see if we can dodge the PostgreSQL bug.
>>
>> Karl
>>
>>
>>
>> On Tue, Nov 16, 2010 at 9:31 AM, Farzad Valad <hokie@farzad.net> wrote:
>>> Here is the output:
>>>
>>>
>>> C:\Program Files (x86)\PostgreSQL\8.3\bin>psql -U manifoldcf dbname
>>> Password for user manifoldcf:
>>> Welcome to psql 8.3.12, the PostgreSQL interactive terminal.
>>>
>>> Type:  \copyright for distribution terms
>>>       \h for help with SQL commands
>>>       \? for help with psql commands
>>>       \g or terminate with semicolon to execute query
>>>       \q to quit
>>>
>>> Warning: Console code page (437) differs from Windows code page (1252)
>>>         8-bit characters might not work correctly. See psql reference
>>>         page "Notes for Windows users" for details.
>>>
>>> dbname=> \dt jobqueue;
>>>           List of relations
>>>  Schema |   Name   | Type  |   Owner
>>> --------+----------+-------+------------
>>>  public | jobqueue | table | manifoldcf
>>> (1 row)
>>>
>>> dbname=> \dt carrydown;
>>>            List of relations
>>>  Schema |   Name    | Type  |   Owner
>>> --------+-----------+-------+------------
>>>  public | carrydown | table | manifoldcf
>>> (1 row)
>>>
>>> dbname=> select count(*) from carrydown;
>>>  count
>>> -------
>>>     0
>>> (1 row)
>>>
>>> dbname=> select distinct jobid from carrydown;
>>>  jobid
>>> -------
>>> (0 rows)
>>>
>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0
>>> WHERE
>>> EXISTS(
>>> SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>> t1.childidhash=t0.doch
>>> ash AND t0.jobid=1289847478234 AND
>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>
>>>     QUERY PLAN
>>>
>>> ----------------------------------------------------------------------------
>>> ----
>>> ----------------------------------------------------------------------------
>>> ----
>>> --------------------
>>>  Seq Scan on jobqueue t0  (cost=0.00..2000133.07 rows=118994 width=158)
>>>   Filter: (subplan)
>>>   SubPlan
>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>           ->  Index Scan using i1289847375544 on carrydown t1
>>> (cost=0.00..8.28
>>>  rows=1 width=0)
>>>                 Index Cond: ((jobid = 1289847478234::bigint) AND
>>> ((parentidhash
>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>> ((childidhash)::
>>> text = ($0)::text))
>>>                 Filter: (isnew = 'B'::bpchar)
>>> (8 rows)
>>>
>>> dbname=> analyze carrydown;
>>> ANALYZE
>>> dbname=> analyze jobqueue;
>>> ANALYZE
>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0
>>> WHERE
>>> dbname-> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>> dbname(> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>
>>>     QUERY PLAN
>>>
>>> ----------------------------------------------------------------------------
>>> ----
>>> ----------------------------------------------------------------------------
>>> ----
>>> --------------------
>>>  Seq Scan on jobqueue t0  (cost=0.00..2091946.86 rows=124532 width=158)
>>>   Filter: (subplan)
>>>   SubPlan
>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>           ->  Index Scan using i1289847375544 on carrydown t1
>>> (cost=0.00..8.28
>>>  rows=1 width=0)
>>>                 Index Cond: ((jobid = 1289847478234::bigint) AND
>>> ((parentidhash
>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>> ((childidhash)::
>>> text = ($0)::text))
>>>                 Filter: (isnew = 'B'::bpchar)
>>> (8 rows)
>>>
>>> dbname=>
>>>
>>> -----Original Message-----
>>> From: Karl Wright [mailto:daddywri@gmail.com]
>>> Sent: Tuesday, November 16, 2010 4:21 AM
>>> To: connectors-user@incubator.apache.org
>>> Subject: Re: Performance Problems Continue
>>>
>>> As I suspected, the plan it is generating is crap:
>>>
>>> Found a query that took more than a minute (62494 ms): [SELECT
>>> t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x'
>>> FROM carrydown t1 WHERE parentidhash IN (?) AND
>>> t1.childidhash=t0.dochash AND t0.jobid=? AND t1.jobid=? AND
>>> t1.isnew=?)]
>>>  Parameter 0: 'B0C2E57717919821A7C32D9F92F9F297155B727F'
>>>  Parameter 1: '1289847478234'
>>>  Parameter 2: '1289847478234'
>>>  Parameter 3: 'B'
>>>  Plan: Seq Scan on jobqueue t0  (cost=0.00..1332448.02 rows=79465
>>> width=152)
>>>  Plan:   Filter: (subplan)
>>>  Plan:   SubPlan
>>>  Plan:     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>  Plan:           One-Time Filter: ($1 = 1289847478234::bigint)
>>>  Plan:           ->  Index Scan using i1289847375544 on carrydown t1
>>> (cost=0.00..8.28 rows=1 width=0)
>>>  Plan:                 Index Cond: ((jobid = 1289847478234::bigint)
>>> AND ((parentidhash)::text =
>>> 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>> ((childidhash)::text = ($0)::text))
>>>  Plan:                 Filter: (isnew = 'B'::bpchar)
>>>
>>>
>>>
>>> ... which means that it is doing a sequential scan on the jobqueue
>>> table (large) and an index scan for each row on the carrydown table
>>> (which should be empty), rather than the other way round.  There is a
>>> carrydown index that is should be using instead, which it is
>>> completely ignoring, which is all the more weird.
>>>
>>> Can you do the following:
>>>
>>> (1) Under the postgresql bin directory, run psql -U manifoldcf dbname
>>> (2) If the defaults are in place, password should be "local_pg_passwd"
>>> (3) Run the following commands, and cut/paste the answers for me:
>>>
>>> \dt jobqueue;
>>> \dt carrydown;
>>> select count(*) from carrydown;
>>> select distinct jobid from carrydown;
>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>
>>> Once you have done that, please do the following:
>>>
>>> analyze carrydown;
>>> analyze jobqueue;
>>>
>>> Then, try the explain again. and send me that output too:
>>>
>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>
>>>
>>> I've tried the same commands on my system, but they show reasonable
>>> plans.  PostgreSQL, like all databases, develops its plans based on
>>> statistics it keeps for the tables in question.  The analyze command
>>> tells it to update those statistics, which in theory should make
>>> PostgreSQL immediately thereafter use a correct plan.  However, there
>>> may be great sensitivity to incorrect statistics for some sorts of
>>> query, depending on the conditions, so a small number of accumulated
>>> changes can cause it to start to make huge errors.  I am trying to
>>> determine if that's what is happening, and what those conditions are
>>> in your case.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Mon, Nov 15, 2010 at 9:44 PM, Farzad Valad <hokie@farzad.net> wrote:
>>>> 1. Using 8.3.12, gave up on 9.x
>>>> 2. Yes, I do.  Attached the log file for this crawl
>>>> 3. Using Null Output and File System input, not using RSS
>>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>

Mime
View raw message