manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: JDBC Connection Exception
Date Mon, 07 May 2012 09:34:50 GMT
What database are you using?  (Not the JDBC database, the underlying
one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
you could also post some of the long-running queries, that would be
good as well.

Depending on the database, ManifoldCF periodically
re-analyzes/reindexes the underlying database during the crawl, which
when the table is large can cause some warnings about long-running
queries, because during the reindex process the database performance
is slowed.  That's not usually a problem, other than briefly slowing
the crawl.  However, it's also possible that there's a point where
Postgresql's plan is poor, and we should see that because the warning
also dumps the plan.

Truncating the jobqueue table is not recommended, since then
ManifoldCF has no idea of what it has crawled and what it hasn't, and
its incremental properties tend to suffer.

Karl


On Mon, May 7, 2012 at 1:25 AM, Michael Le <michael.aaron.le@gmail.com> wrote:
> Hello,
>
> Using a JDBC Repository connection to an Oracle 11g database, I've had
> issues where in the initial seeding stage the connection to the database is
> closing in the middle of processing the result set.  The original data table
> I'm trying to index is about 10 million records, and with the original code,
> I could never get past about 750K records.
>
> I spent some time with the pooling parameters to the bitmachanic database
> pooling, but the API and source doesn't seem to be available any more.  Even
> the original author doesn't have the code or specs any more.  The parameter
> modifications to the pool allowed me to get through the first stage of
> processing a 2M row subset, but during the second stage where it's trying to
> obtain the documents, the connections again started being closed.  I ended
> up just replacing the connection pool code, with an oracle implementation,
> and its churning through the documents happily.  As a foot note, on my
> sample subset of about 400K documents, the throughput went from about 10
> documents/s to 19 docs/s, but this may just be a side effect of oracle
> database load or network traffic.
>
> Has anyone else had issues processing a large Oracle repository?  I've noted
> the benchmarks were done with 300K documents, and even in our initial
> testing with about 500K documents, no issues arose.
>
> The second and more pressing issue is the jobqueues table.  In the process
> of dubugging the database connection issues, jobs were started, stopped,
> deleted, aborted, and various WHERE clauses were applied to the seeding
> queries/jobs.   MCF is now reporting that there are long running queries
> against this table.  In the past, I've just truncated the jobqueues table,
> but this had the side effect of stuffing a document into solr (output
> connector) multiple times.  What API calls, or sql can I run to clean up the
> jobqueues table?  Should I just wait for all jobs to finish and then at that
> point truncate the table?  I've broken my data into several smaller subsets
> of around 1-2 million rows, but that has the side effect of a jobqueues
> table that is 6-8 million rows.
>
> Any support would be greatly appreciated.
>
> Thanks,
> -Michael Le
>
>

Mime
View raw message