manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Tue, 22 Jan 2013 18:51:48 GMT
I've checked in code in both trunk and the release branch for this issue.

It would be good if you could try this again in your environment.  The
fix simply prevents some kinds of exceptions from causing a handle
leak.  Please try this with only 1 JDBC Connection connection handle
per JVM and let me know if you see any hangs.

Thanks,
Karl


On Tue, Jan 22, 2013 at 1:11 PM, Karl Wright <daddywri@gmail.com> wrote:
> I've looked into the code in some detail.  There is indeed a place
> where it is possible for a JDBC connection handle to be leaked, I
> believe.  However, it's not clear whether this is the circumstance you
> are encountering or not, since it does in involve an exception getting
> thrown doing something not terribly likely to cause exceptions.
>
> I've opened a ticket - CONNECTORS-620.
>
> Karl
>
> On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright <daddywri@gmail.com> wrote:
>> Hmm.
>>
>> The following threads are of interest here:
>>
>> Thread 29975: (state = BLOCKED)
>>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> be imprecise)
>>  - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
>>  - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.<init>(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
>> java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
>> (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
>> java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
>> org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
>> long, int) @bci=106, line=246 (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
>> line=179 (Interpreted frame)
>>
>> ... which is probably waiting in this one:
>>
>> Thread 24457: (state = BLOCKED)
>>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> be imprecise)
>>  - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
>>  - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
>> @bci=80, line=80 (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
>> java.lang.String, java.lang.String, java.lang.String,
>> java.lang.String) @bci=433, line=128 (Interpreted frame)
>>  - org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
>> @bci=36, line=1212 (Interpreted frame)
>>
>> ... which is waiting to obtain a JDBC connection, and the reason it
>> can't is because it thinks that the only available JDBC connection is
>> currently in use.
>>
>> Since you have only a single connection around, and nothing else is
>> active, it stands to reason that a JDBC connection handle has somehow
>> been leaked, which is a challenge since connections are typically
>> freed in a try/finally block through ManifoldCF.
>>
>> I notice that your stack frames are pretty unusual - what JDK is this
>> that you are using?
>>
>> Karl
>>
>>
>> Since
>> On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
>> <anthony.leonard@york.ac.uk> wrote:
>>> Dear Karl,
>>>
>>> Our DBA noticed that each time our job was run 10 Oracle connections were
>>> created. So, we dropped the "Max connections" parameter on the repository
>>> connection config to 1 and re-ran the job with the DBA watching. The job
>>> worked fine but the DBA reported that 1 connection was created and then 10
>>> more briefly ...
>>>
>>> Out of curiosity we re-ran the job again with no further changes and this
>>> time got the following results:
>>>
>>> * the job hung in the "Starting Up" phase again, with the same logging and
>>> symptoms as detailed before on this thread.
>>> * the DBA reported seeing no connections at all this time.
>>> * I have attached a thread dump created by "jstack -F <pid>". This is
>>> reporting all threads as blocked.
>>>
>>> Any ideas? Any help with this would certainly be very gratefully received.
>>>
>>> Best wishes,
>>> Anthony.
>>>
>>> --
>>> Dr Anthony Leonard
>>> System Integrator, Information Directorate
>>> University of York, Heslington, York, UK, YO10 5DD
>>> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>>> Times Higher Education University of the Year 2010
>>>
>>>
>>> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>> kill -QUIT should not abort the agents process, just cause a thread
>>>> dump.  kill -9 is a different story.
>>>>
>>>> You can also do the same thing by using jstack, in the jvm bin directory.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>>>> <anthony.leonard@york.ac.uk> wrote:
>>>> > Dear Karl,
>>>> >
>>>> > Many thanks for your insights. I'll do a kill -QUIT next time we have
>>>> > this
>>>> > issue which should hopefully give me the thread dump. However we've
>>>> > noticed
>>>> > that killing processes means we have to run the "locks-clean" script
so
>>>> > it's
>>>> > not our favourite way of doing it.
>>>> >
>>>> > Also I definitely think a timeout for queries would be a good thing.
>>>> >
>>>> > I guess we go back to checking that the connection to the database
>>>> > should
>>>> > have been ok last night...
>>>> >
>>>> > Best wishes,
>>>> > Anthony.
>>>> >
>>>> > --
>>>> > Dr Anthony Leonard
>>>> > System Integrator, Information Directorate
>>>> > University of York, Heslington, York, UK, YO10 5DD
>>>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>>>> > Times Higher Education University of the Year 2010
>>>> >
>>>> >
>>>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <daddywri@gmail.com>
wrote:
>>>> >>
>>>> >> Hi Anthony,
>>>> >>
>>>> >> What happens between the framework recognizing that the job should
be
>>>> >> started (which it does fine in both cases), and actually achieving
a
>>>> >> correct job start, is the "seeding" phase, which is going to try
to
>>>> >> execute the seeding query against your Oracle database.  If something
>>>> >> happens at that time to hang the JDBC connection's seeding query,
then
>>>> >> it precisely explains the behavior you are seeing.
>>>> >>
>>>> >> It is also the case that the timeout on the queries that the JDBC
>>>> >> connector does is effectively infinite.  This makes me suspicious
that
>>>> >> what is happening is an Oracle query is going out but there is no
>>>> >> response ever coming back.
>>>> >>
>>>> >> The other possibility is that the JDBC connector is in fact correctly
>>>> >> throwing a ServiceInterruption, but that the ManifoldCF code is
either
>>>> >> not handling it properly, or the connector is not forming it properly.
>>>> >>  In that case, when you notice a hung job, the startup thread will
be
>>>> >> a particular place in the code, and I can diagnose it that way.
>>>> >>
>>>> >> The first order of business is therefore to get a thread dump when
the
>>>> >> system is "hung".  That will help confirm the picture.  There are
a
>>>> >> number of additional questions here.
>>>> >>
>>>> >> (1) Why is this happening?  Is there any possibility that the Oracle
>>>> >> database you are crawling is (very occasionally) not able to properly
>>>> >> respond to a JDBC query?  I can imagine that, under some network
>>>> >> conditions, it might be possible for the Oracle JDBC driver to wind
up
>>>> >> waiting indefinitely for a response that never comes.
>>>> >>
>>>> >> (2) Given that we can't always control the infrastructure we're
trying
>>>> >> to crawl through, should we attempt to provide a reasonable
>>>> >> workaround?  For example, a timeout on JDBC connector queries, where
>>>> >> we throw a ServiceInterruption if the timeout is exceeded?
>>>> >>
>>>> >> Karl
>>>> >>
>>>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>>>> >> <anthony.leonard@york.ac.uk> wrote:
>>>> >> > Hi there,
>>>> >> >
>>>> >> > We have recently started running a nightly job 2AM in ManifoldCF
to
>>>> >> > extract
>>>> >> > data from an Oracle repository and populate a Solr index. Most
nights
>>>> >> > this
>>>> >> > works fine, but occasionally the job has been hanging at the
>>>> >> > "Starting
>>>> >> > up"
>>>> >> > phase. We have observed this on our test setup also occasionally.
A
>>>> >> > restart
>>>> >> > of ManifoldCF usually solves this.
>>>> >> >
>>>> >> > Using the simple history reports today I looked up all records
and
>>>> >> > sorted
>>>> >> > them by the "Time" column, largest first, and found the following:
>>>> >> >
>>>> >> > "Start Time","Activity","Identifier","Result
>>>> >> > Code","Bytes","Time","Result
>>>> >> > Description"
>>>> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>>>> >> > ...","ERROR","0","1926607529","Interrupted: null"
>>>> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>>>> >> > ...","ERROR","0","31644956","Interrupted: null"
>>>> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>>>> >> > ...","ERROR","0","31637594","Interrupted: null"
>>>> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>>>> >> > ...","OK","0","17511",""
>>>> >> > ... etc ...
>>>> >> >
>>>> >> > If the Time column is in millis that means the first query
was
>>>> >> > hanging
>>>> >> > for
>>>> >> > 22 days! (This was in the period before we went live when our
live
>>>> >> > server
>>>> >> > was sitting idle for a while.) The other two occasions it was
hanging
>>>> >> > for
>>>> >> > about 8 hours until we arrived to restart the job in the morning.
I
>>>> >> > have
>>>> >> > confirmed that the Oracle database we are connecting to was
available
>>>> >> > throughout these periods. These times are also too long for
any
>>>> >> > network
>>>> >> > or
>>>> >> > database timeouts, which makes me suspect that it's a problem
with
>>>> >> > the
>>>> >> > application.
>>>> >> >
>>>> >> > We have the following logging config in properties.xml
>>>> >> >
>>>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>>>> >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>>>> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>>>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>>>> >> >
>>>> >> > The job failed again last night and when I checked at 10:40
AM this
>>>> >> > morning
>>>> >> > the last few lines of manifoldcf.log were:
>>>> >> >
>>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking
if job
>>>> >> > 1352455005553 needs to be started; it was last checked at
>>>> >> > 1358733575454,
>>>> >> > and
>>>> >> > now it is 1358733585635
>>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time
match
>>>> >> > found
>>>> >> > within interval 1358733575454 to 1358733585635
>>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking
if job
>>>> >> > 1352455005553 needs to be started; it was last checked at
>>>> >> > 1358733585636,
>>>> >> > and
>>>> >> > now it is 1358733595662
>>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time
match
>>>> >> > found
>>>> >> > within interval 1358733585636 to 1358733595662
>>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking
if job
>>>> >> > 1352455005553 needs to be started; it was last checked at
>>>> >> > 1358733595663,
>>>> >> > and
>>>> >> > now it is 1358733605813
>>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match
FOUND
>>>> >> > within
>>>> >> > interval 1358733595663 to 1358733605813
>>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>>>> >> > '1352455005553'
>>>> >> > is
>>>> >> > within run window at 1358733605813 ms. (which starts at 1358733600000
>>>> >> > ms.)
>>>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled
for job
>>>> >> > start
>>>> >> > for job 1352455005553
>>>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>>>> >> > 1352455005553
>>>> >> > for startup
>>>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect
>>>> >> > string
>>>> >> > is
>>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>>>> >> >
>>>> >> > After that - nothing. Once I restarted manifoldcf this morning
the
>>>> >> > job
>>>> >> > magically restarted itself and the following log messages were
added
>>>> >> > where
>>>> >> > it had left off:
>>>> >> >
>>>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>>>> >> > 1352455005553
>>>> >> > back to 'ReadyForStartup' state
>>>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to
restart
>>>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>>>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>>>> >> > 1352455005553
>>>> >> > for startup
>>>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect
string
>>>> >> > is
>>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>>>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553
is
>>>> >> > now
>>>> >> > started
>>>> >> > ... etc ...
>>>> >> >
>>>> >> > So it appears that the application is running fine with a Job
start
>>>> >> > thread
>>>> >> > logging nicely every ten seconds until it tries to start this
job and
>>>> >> > hangs
>>>> >> > entirely until the system is shutdown - even though it performed
that
>>>> >> > same
>>>> >> > tasks perfectly the night before and later the same day.
>>>> >> >
>>>> >> > Can anyone advise on what might be happening here? We are running
>>>> >> > ManifoldCF
>>>> >> > version 1.0.1 on Ubuntu 10.04.
>>>> >> >
>>>> >> > Best wishes,
>>>> >> > Anthony.
>>>> >> >
>>>> >> > --
>>>> >> > Dr Anthony Leonard
>>>> >> > System Integrator, Information Directorate
>>>> >> > University of York, Heslington, York, UK, YO10 5DD
>>>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>>>> >> > Times Higher Education University of the Year 2010
>>>> >
>>>> >
>>>
>>>

Mime
View raw message