manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: JDBC Connection Exception
Date Mon, 14 May 2012 18:26:48 GMT
This was committed to trunk last week, and seems to work well.
Karl

On Wed, May 9, 2012 at 11:20 AM, Karl Wright <daddywri@gmail.com> wrote:
> FWIW, the ticket is CONNECTORS-96.  I've created a branch to work on
> it.  I'll let you know when I think it's ready to try out.
>
> Karl
>
>
> On Mon, May 7, 2012 at 5:53 AM, Karl Wright <daddywri@gmail.com> wrote:
>> Also, there has been a long-running ticket to replace the JDBC pool
>> driver with something more modern for a while.  Many of the
>> off-the-shelf pool drivers are inadequate for various reasons, so I
>> have one that I wrote myself, but it is not yet committed.  So I am
>> curious - which connections are timing out?  The Oracle connections or
>> the Postgresql ones?
>>
>> Karl
>>
>> On Mon, May 7, 2012 at 5:34 AM, Karl Wright <daddywri@gmail.com> wrote:
>>> What database are you using?  (Not the JDBC database, the underlying
>>> one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
>>> you could also post some of the long-running queries, that would be
>>> good as well.
>>>
>>> Depending on the database, ManifoldCF periodically
>>> re-analyzes/reindexes the underlying database during the crawl, which
>>> when the table is large can cause some warnings about long-running
>>> queries, because during the reindex process the database performance
>>> is slowed.  That's not usually a problem, other than briefly slowing
>>> the crawl.  However, it's also possible that there's a point where
>>> Postgresql's plan is poor, and we should see that because the warning
>>> also dumps the plan.
>>>
>>> Truncating the jobqueue table is not recommended, since then
>>> ManifoldCF has no idea of what it has crawled and what it hasn't, and
>>> its incremental properties tend to suffer.
>>>
>>> Karl
>>>
>>>
>>> On Mon, May 7, 2012 at 1:25 AM, Michael Le <michael.aaron.le@gmail.com>
wrote:
>>>> Hello,
>>>>
>>>> Using a JDBC Repository connection to an Oracle 11g database, I've had
>>>> issues where in the initial seeding stage the connection to the database
is
>>>> closing in the middle of processing the result set.  The original data table
>>>> I'm trying to index is about 10 million records, and with the original code,
>>>> I could never get past about 750K records.
>>>>
>>>> I spent some time with the pooling parameters to the bitmachanic database
>>>> pooling, but the API and source doesn't seem to be available any more.  Even
>>>> the original author doesn't have the code or specs any more.  The parameter
>>>> modifications to the pool allowed me to get through the first stage of
>>>> processing a 2M row subset, but during the second stage where it's trying
to
>>>> obtain the documents, the connections again started being closed.  I ended
>>>> up just replacing the connection pool code, with an oracle implementation,
>>>> and its churning through the documents happily.  As a foot note, on my
>>>> sample subset of about 400K documents, the throughput went from about 10
>>>> documents/s to 19 docs/s, but this may just be a side effect of oracle
>>>> database load or network traffic.
>>>>
>>>> Has anyone else had issues processing a large Oracle repository?  I've noted
>>>> the benchmarks were done with 300K documents, and even in our initial
>>>> testing with about 500K documents, no issues arose.
>>>>
>>>> The second and more pressing issue is the jobqueues table.  In the process
>>>> of dubugging the database connection issues, jobs were started, stopped,
>>>> deleted, aborted, and various WHERE clauses were applied to the seeding
>>>> queries/jobs.   MCF is now reporting that there are long running queries
>>>> against this table.  In the past, I've just truncated the jobqueues table,
>>>> but this had the side effect of stuffing a document into solr (output
>>>> connector) multiple times.  What API calls, or sql can I run to clean up
the
>>>> jobqueues table?  Should I just wait for all jobs to finish and then at
that
>>>> point truncate the table?  I've broken my data into several smaller subsets
>>>> of around 1-2 million rows, but that has the side effect of a jobqueues
>>>> table that is 6-8 million rows.
>>>>
>>>> Any support would be greatly appreciated.
>>>>
>>>> Thanks,
>>>> -Michael Le
>>>>
>>>>

Mime
View raw message