manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Leonard <anthony.leon...@york.ac.uk>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Fri, 25 Jan 2013 17:02:50 GMT
Hi Karl,

Thank you so much for this. Sorry for the lack of response as we've been
working on other things.

One question - would we have to build ManifoldCF ourselves to get the new
code you've checked in or would it already be part of a binary distribution
somewhere?

Best wishes,
Anthony.

-- 
Dr Anthony Leonard
System Integrator, Information Directorate
University of York, Heslington, York, UK, YO10 5DD
Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
Times Higher Education University of the Year 2010


On Tue, Jan 22, 2013 at 6:51 PM, Karl Wright <daddywri@gmail.com> wrote:

> I've checked in code in both trunk and the release branch for this issue.
>
> It would be good if you could try this again in your environment.  The
> fix simply prevents some kinds of exceptions from causing a handle
> leak.  Please try this with only 1 JDBC Connection connection handle
> per JVM and let me know if you see any hangs.
>
> Thanks,
> Karl
>
>
> On Tue, Jan 22, 2013 at 1:11 PM, Karl Wright <daddywri@gmail.com> wrote:
> > I've looked into the code in some detail.  There is indeed a place
> > where it is possible for a JDBC connection handle to be leaked, I
> > believe.  However, it's not clear whether this is the circumstance you
> > are encountering or not, since it does in involve an exception getting
> > thrown doing something not terribly likely to cause exceptions.
> >
> > I've opened a ticket - CONNECTORS-620.
> >
> > Karl
> >
> > On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >> Hmm.
> >>
> >> The following threads are of interest here:
> >>
> >> Thread 29975: (state = BLOCKED)
> >>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
> >> be imprecise)
> >>  - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
> >>  - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.<init>(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
> >> java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
> >> (Interpreted frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
> >> java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
> >> org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
> >> long, int) @bci=106, line=246 (Interpreted frame)
> >>  - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
> >> line=179 (Interpreted frame)
> >>
> >> ... which is probably waiting in this one:
> >>
> >> Thread 24457: (state = BLOCKED)
> >>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
> >> be imprecise)
> >>  - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
> >>  - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
> >> @bci=80, line=80 (Interpreted frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
> >> java.lang.String, java.lang.String, java.lang.String,
> >> java.lang.String) @bci=433, line=128 (Interpreted frame)
> >>  -
> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
> >> @bci=36, line=1212 (Interpreted frame)
> >>
> >> ... which is waiting to obtain a JDBC connection, and the reason it
> >> can't is because it thinks that the only available JDBC connection is
> >> currently in use.
> >>
> >> Since you have only a single connection around, and nothing else is
> >> active, it stands to reason that a JDBC connection handle has somehow
> >> been leaked, which is a challenge since connections are typically
> >> freed in a try/finally block through ManifoldCF.
> >>
> >> I notice that your stack frames are pretty unusual - what JDK is this
> >> that you are using?
> >>
> >> Karl
> >>
> >>
> >> Since
> >> On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
> >> <anthony.leonard@york.ac.uk> wrote:
> >>> Dear Karl,
> >>>
> >>> Our DBA noticed that each time our job was run 10 Oracle connections
> were
> >>> created. So, we dropped the "Max connections" parameter on the
> repository
> >>> connection config to 1 and re-ran the job with the DBA watching. The
> job
> >>> worked fine but the DBA reported that 1 connection was created and
> then 10
> >>> more briefly ...
> >>>
> >>> Out of curiosity we re-ran the job again with no further changes and
> this
> >>> time got the following results:
> >>>
> >>> * the job hung in the "Starting Up" phase again, with the same logging
> and
> >>> symptoms as detailed before on this thread.
> >>> * the DBA reported seeing no connections at all this time.
> >>> * I have attached a thread dump created by "jstack -F <pid>". This
is
> >>> reporting all threads as blocked.
> >>>
> >>> Any ideas? Any help with this would certainly be very gratefully
> received.
> >>>
> >>> Best wishes,
> >>> Anthony.
> >>>
> >>> --
> >>> Dr Anthony Leonard
> >>> System Integrator, Information Directorate
> >>> University of York, Heslington, York, UK, YO10 5DD
> >>> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> >>> Times Higher Education University of the Year 2010
> >>>
> >>>
> >>> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>>>
> >>>> kill -QUIT should not abort the agents process, just cause a thread
> >>>> dump.  kill -9 is a different story.
> >>>>
> >>>> You can also do the same thing by using jstack, in the jvm bin
> directory.
> >>>>
> >>>> Karl
> >>>>
> >>>>
> >>>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
> >>>> <anthony.leonard@york.ac.uk> wrote:
> >>>> > Dear Karl,
> >>>> >
> >>>> > Many thanks for your insights. I'll do a kill -QUIT next time we
> have
> >>>> > this
> >>>> > issue which should hopefully give me the thread dump. However we've
> >>>> > noticed
> >>>> > that killing processes means we have to run the "locks-clean"
> script so
> >>>> > it's
> >>>> > not our favourite way of doing it.
> >>>> >
> >>>> > Also I definitely think a timeout for queries would be a good thing.
> >>>> >
> >>>> > I guess we go back to checking that the connection to the database
> >>>> > should
> >>>> > have been ok last night...
> >>>> >
> >>>> > Best wishes,
> >>>> > Anthony.
> >>>> >
> >>>> > --
> >>>> > Dr Anthony Leonard
> >>>> > System Integrator, Information Directorate
> >>>> > University of York, Heslington, York, UK, YO10 5DD
> >>>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> >>>> > Times Higher Education University of the Year 2010
> >>>> >
> >>>> >
> >>>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>>> >>
> >>>> >> Hi Anthony,
> >>>> >>
> >>>> >> What happens between the framework recognizing that the job
should
> be
> >>>> >> started (which it does fine in both cases), and actually achieving
> a
> >>>> >> correct job start, is the "seeding" phase, which is going to
try to
> >>>> >> execute the seeding query against your Oracle database.  If
> something
> >>>> >> happens at that time to hang the JDBC connection's seeding
query,
> then
> >>>> >> it precisely explains the behavior you are seeing.
> >>>> >>
> >>>> >> It is also the case that the timeout on the queries that the
JDBC
> >>>> >> connector does is effectively infinite.  This makes me suspicious
> that
> >>>> >> what is happening is an Oracle query is going out but there
is no
> >>>> >> response ever coming back.
> >>>> >>
> >>>> >> The other possibility is that the JDBC connector is in fact
> correctly
> >>>> >> throwing a ServiceInterruption, but that the ManifoldCF code
is
> either
> >>>> >> not handling it properly, or the connector is not forming it
> properly.
> >>>> >>  In that case, when you notice a hung job, the startup thread
will
> be
> >>>> >> a particular place in the code, and I can diagnose it that
way.
> >>>> >>
> >>>> >> The first order of business is therefore to get a thread dump
when
> the
> >>>> >> system is "hung".  That will help confirm the picture.  There
are a
> >>>> >> number of additional questions here.
> >>>> >>
> >>>> >> (1) Why is this happening?  Is there any possibility that the
> Oracle
> >>>> >> database you are crawling is (very occasionally) not able to
> properly
> >>>> >> respond to a JDBC query?  I can imagine that, under some network
> >>>> >> conditions, it might be possible for the Oracle JDBC driver
to
> wind up
> >>>> >> waiting indefinitely for a response that never comes.
> >>>> >>
> >>>> >> (2) Given that we can't always control the infrastructure we're
> trying
> >>>> >> to crawl through, should we attempt to provide a reasonable
> >>>> >> workaround?  For example, a timeout on JDBC connector queries,
> where
> >>>> >> we throw a ServiceInterruption if the timeout is exceeded?
> >>>> >>
> >>>> >> Karl
> >>>> >>
> >>>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
> >>>> >> <anthony.leonard@york.ac.uk> wrote:
> >>>> >> > Hi there,
> >>>> >> >
> >>>> >> > We have recently started running a nightly job 2AM in
ManifoldCF
> to
> >>>> >> > extract
> >>>> >> > data from an Oracle repository and populate a Solr index.
Most
> nights
> >>>> >> > this
> >>>> >> > works fine, but occasionally the job has been hanging
at the
> >>>> >> > "Starting
> >>>> >> > up"
> >>>> >> > phase. We have observed this on our test setup also
> occasionally. A
> >>>> >> > restart
> >>>> >> > of ManifoldCF usually solves this.
> >>>> >> >
> >>>> >> > Using the simple history reports today I looked up all
records
> and
> >>>> >> > sorted
> >>>> >> > them by the "Time" column, largest first, and found the
> following:
> >>>> >> >
> >>>> >> > "Start Time","Activity","Identifier","Result
> >>>> >> > Code","Bytes","Time","Result
> >>>> >> > Description"
> >>>> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
> >>>> >> > ...","ERROR","0","1926607529","Interrupted: null"
> >>>> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
> >>>> >> > ...","ERROR","0","31644956","Interrupted: null"
> >>>> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
> >>>> >> > ...","ERROR","0","31637594","Interrupted: null"
> >>>> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
> >>>> >> > ...","OK","0","17511",""
> >>>> >> > ... etc ...
> >>>> >> >
> >>>> >> > If the Time column is in millis that means the first query
was
> >>>> >> > hanging
> >>>> >> > for
> >>>> >> > 22 days! (This was in the period before we went live when
our
> live
> >>>> >> > server
> >>>> >> > was sitting idle for a while.) The other two occasions
it was
> hanging
> >>>> >> > for
> >>>> >> > about 8 hours until we arrived to restart the job in the
> morning. I
> >>>> >> > have
> >>>> >> > confirmed that the Oracle database we are connecting to
was
> available
> >>>> >> > throughout these periods. These times are also too long
for any
> >>>> >> > network
> >>>> >> > or
> >>>> >> > database timeouts, which makes me suspect that it's a
problem
> with
> >>>> >> > the
> >>>> >> > application.
> >>>> >> >
> >>>> >> > We have the following logging config in properties.xml
> >>>> >> >
> >>>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
> >>>> >> >   <property name="org.apache.manifoldcf.connectors"
value="ALL"/>
> >>>> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
> >>>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
> >>>> >> >
> >>>> >> > The job failed again last night and when I checked at
10:40 AM
> this
> >>>> >> > morning
> >>>> >> > the last few lines of manifoldcf.log were:
> >>>> >> >
> >>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking
if
> job
> >>>> >> > 1352455005553 needs to be started; it was last checked
at
> >>>> >> > 1358733575454,
> >>>> >> > and
> >>>> >> > now it is 1358733585635
> >>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No
time match
> >>>> >> > found
> >>>> >> > within interval 1358733575454 to 1358733585635
> >>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking
if
> job
> >>>> >> > 1352455005553 needs to be started; it was last checked
at
> >>>> >> > 1358733585636,
> >>>> >> > and
> >>>> >> > now it is 1358733595662
> >>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No
time match
> >>>> >> > found
> >>>> >> > within interval 1358733585636 to 1358733595662
> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking
if
> job
> >>>> >> > 1352455005553 needs to be started; it was last checked
at
> >>>> >> > 1358733595663,
> >>>> >> > and
> >>>> >> > now it is 1358733605813
> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time
match
> FOUND
> >>>> >> > within
> >>>> >> > interval 1358733595663 to 1358733605813
> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
> >>>> >> > '1352455005553'
> >>>> >> > is
> >>>> >> > within run window at 1358733605813 ms. (which starts at
> 1358733600000
> >>>> >> > ms.)
> >>>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled
for
> job
> >>>> >> > start
> >>>> >> > for job 1352455005553
> >>>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked
job
> >>>> >> > 1352455005553
> >>>> >> > for startup
> >>>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC:
The connect
> >>>> >> > string
> >>>> >> > is
> >>>> >> > 'jdbc:oracle:thin:@//
> oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >>>> >> >
> >>>> >> > After that - nothing. Once I restarted manifoldcf this
morning
> the
> >>>> >> > job
> >>>> >> > magically restarted itself and the following log messages
were
> added
> >>>> >> > where
> >>>> >> > it had left off:
> >>>> >> >
> >>>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting
job
> >>>> >> > 1352455005553
> >>>> >> > back to 'ReadyForStartup' state
> >>>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting
due to
> restart
> >>>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
> >>>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked
job
> >>>> >> > 1352455005553
> >>>> >> > for startup
> >>>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The
connect
> string
> >>>> >> > is
> >>>> >> > 'jdbc:oracle:thin:@//
> oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >>>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job
> 1352455005553 is
> >>>> >> > now
> >>>> >> > started
> >>>> >> > ... etc ...
> >>>> >> >
> >>>> >> > So it appears that the application is running fine with
a Job
> start
> >>>> >> > thread
> >>>> >> > logging nicely every ten seconds until it tries to start
this
> job and
> >>>> >> > hangs
> >>>> >> > entirely until the system is shutdown - even though it
performed
> that
> >>>> >> > same
> >>>> >> > tasks perfectly the night before and later the same day.
> >>>> >> >
> >>>> >> > Can anyone advise on what might be happening here? We
are running
> >>>> >> > ManifoldCF
> >>>> >> > version 1.0.1 on Ubuntu 10.04.
> >>>> >> >
> >>>> >> > Best wishes,
> >>>> >> > Anthony.
> >>>> >> >
> >>>> >> > --
> >>>> >> > Dr Anthony Leonard
> >>>> >> > System Integrator, Information Directorate
> >>>> >> > University of York, Heslington, York, UK, YO10 5DD
> >>>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> >>>> >> > Times Higher Education University of the Year 2010
> >>>> >
> >>>> >
> >>>
> >>>
>

Mime
View raw message