manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: JDBC Connection Exception
Date Wed, 09 May 2012 15:20:24 GMT
FWIW, the ticket is CONNECTORS-96.  I've created a branch to work on
it.  I'll let you know when I think it's ready to try out.

Karl


On Mon, May 7, 2012 at 5:53 AM, Karl Wright <daddywri@gmail.com> wrote:
> Also, there has been a long-running ticket to replace the JDBC pool
> driver with something more modern for a while.  Many of the
> off-the-shelf pool drivers are inadequate for various reasons, so I
> have one that I wrote myself, but it is not yet committed.  So I am
> curious - which connections are timing out?  The Oracle connections or
> the Postgresql ones?
>
> Karl
>
> On Mon, May 7, 2012 at 5:34 AM, Karl Wright <daddywri@gmail.com> wrote:
>> What database are you using?  (Not the JDBC database, the underlying
>> one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
>> you could also post some of the long-running queries, that would be
>> good as well.
>>
>> Depending on the database, ManifoldCF periodically
>> re-analyzes/reindexes the underlying database during the crawl, which
>> when the table is large can cause some warnings about long-running
>> queries, because during the reindex process the database performance
>> is slowed.  That's not usually a problem, other than briefly slowing
>> the crawl.  However, it's also possible that there's a point where
>> Postgresql's plan is poor, and we should see that because the warning
>> also dumps the plan.
>>
>> Truncating the jobqueue table is not recommended, since then
>> ManifoldCF has no idea of what it has crawled and what it hasn't, and
>> its incremental properties tend to suffer.
>>
>> Karl
>>
>>
>> On Mon, May 7, 2012 at 1:25 AM, Michael Le <michael.aaron.le@gmail.com> wrote:
>>> Hello,
>>>
>>> Using a JDBC Repository connection to an Oracle 11g database, I've had
>>> issues where in the initial seeding stage the connection to the database is
>>> closing in the middle of processing the result set.  The original data table
>>> I'm trying to index is about 10 million records, and with the original code,
>>> I could never get past about 750K records.
>>>
>>> I spent some time with the pooling parameters to the bitmachanic database
>>> pooling, but the API and source doesn't seem to be available any more.  Even
>>> the original author doesn't have the code or specs any more.  The parameter
>>> modifications to the pool allowed me to get through the first stage of
>>> processing a 2M row subset, but during the second stage where it's trying to
>>> obtain the documents, the connections again started being closed.  I ended
>>> up just replacing the connection pool code, with an oracle implementation,
>>> and its churning through the documents happily.  As a foot note, on my
>>> sample subset of about 400K documents, the throughput went from about 10
>>> documents/s to 19 docs/s, but this may just be a side effect of oracle
>>> database load or network traffic.
>>>
>>> Has anyone else had issues processing a large Oracle repository?  I've noted
>>> the benchmarks were done with 300K documents, and even in our initial
>>> testing with about 500K documents, no issues arose.
>>>
>>> The second and more pressing issue is the jobqueues table.  In the process
>>> of dubugging the database connection issues, jobs were started, stopped,
>>> deleted, aborted, and various WHERE clauses were applied to the seeding
>>> queries/jobs.   MCF is now reporting that there are long running queries
>>> against this table.  In the past, I've just truncated the jobqueues table,
>>> but this had the side effect of stuffing a document into solr (output
>>> connector) multiple times.  What API calls, or sql can I run to clean up the
>>> jobqueues table?  Should I just wait for all jobs to finish and then at that
>>> point truncate the table?  I've broken my data into several smaller subsets
>>> of around 1-2 million rows, but that has the side effect of a jobqueues
>>> table that is 6-8 million rows.
>>>
>>> Any support would be greatly appreciated.
>>>
>>> Thanks,
>>> -Michael Le
>>>
>>>

Mime
View raw message