manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: JDBC Connection Exception
Date Mon, 07 May 2012 09:53:14 GMT
Also, there has been a long-running ticket to replace the JDBC pool
driver with something more modern for a while.  Many of the
off-the-shelf pool drivers are inadequate for various reasons, so I
have one that I wrote myself, but it is not yet committed.  So I am
curious - which connections are timing out?  The Oracle connections or
the Postgresql ones?

Karl

On Mon, May 7, 2012 at 5:34 AM, Karl Wright <daddywri@gmail.com> wrote:
> What database are you using?  (Not the JDBC database, the underlying
> one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
> you could also post some of the long-running queries, that would be
> good as well.
>
> Depending on the database, ManifoldCF periodically
> re-analyzes/reindexes the underlying database during the crawl, which
> when the table is large can cause some warnings about long-running
> queries, because during the reindex process the database performance
> is slowed.  That's not usually a problem, other than briefly slowing
> the crawl.  However, it's also possible that there's a point where
> Postgresql's plan is poor, and we should see that because the warning
> also dumps the plan.
>
> Truncating the jobqueue table is not recommended, since then
> ManifoldCF has no idea of what it has crawled and what it hasn't, and
> its incremental properties tend to suffer.
>
> Karl
>
>
> On Mon, May 7, 2012 at 1:25 AM, Michael Le <michael.aaron.le@gmail.com> wrote:
>> Hello,
>>
>> Using a JDBC Repository connection to an Oracle 11g database, I've had
>> issues where in the initial seeding stage the connection to the database is
>> closing in the middle of processing the result set.  The original data table
>> I'm trying to index is about 10 million records, and with the original code,
>> I could never get past about 750K records.
>>
>> I spent some time with the pooling parameters to the bitmachanic database
>> pooling, but the API and source doesn't seem to be available any more.  Even
>> the original author doesn't have the code or specs any more.  The parameter
>> modifications to the pool allowed me to get through the first stage of
>> processing a 2M row subset, but during the second stage where it's trying to
>> obtain the documents, the connections again started being closed.  I ended
>> up just replacing the connection pool code, with an oracle implementation,
>> and its churning through the documents happily.  As a foot note, on my
>> sample subset of about 400K documents, the throughput went from about 10
>> documents/s to 19 docs/s, but this may just be a side effect of oracle
>> database load or network traffic.
>>
>> Has anyone else had issues processing a large Oracle repository?  I've noted
>> the benchmarks were done with 300K documents, and even in our initial
>> testing with about 500K documents, no issues arose.
>>
>> The second and more pressing issue is the jobqueues table.  In the process
>> of dubugging the database connection issues, jobs were started, stopped,
>> deleted, aborted, and various WHERE clauses were applied to the seeding
>> queries/jobs.   MCF is now reporting that there are long running queries
>> against this table.  In the past, I've just truncated the jobqueues table,
>> but this had the side effect of stuffing a document into solr (output
>> connector) multiple times.  What API calls, or sql can I run to clean up the
>> jobqueues table?  Should I just wait for all jobs to finish and then at that
>> point truncate the table?  I've broken my data into several smaller subsets
>> of around 1-2 million rows, but that has the side effect of a jobqueues
>> table that is 6-8 million rows.
>>
>> Any support would be greatly appreciated.
>>
>> Thanks,
>> -Michael Le
>>
>>

Mime
View raw message