manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Mon, 21 Jan 2013 13:25:20 GMT
Hi Anthony,

What happens between the framework recognizing that the job should be
started (which it does fine in both cases), and actually achieving a
correct job start, is the "seeding" phase, which is going to try to
execute the seeding query against your Oracle database.  If something
happens at that time to hang the JDBC connection's seeding query, then
it precisely explains the behavior you are seeing.

It is also the case that the timeout on the queries that the JDBC
connector does is effectively infinite.  This makes me suspicious that
what is happening is an Oracle query is going out but there is no
response ever coming back.

The other possibility is that the JDBC connector is in fact correctly
throwing a ServiceInterruption, but that the ManifoldCF code is either
not handling it properly, or the connector is not forming it properly.
 In that case, when you notice a hung job, the startup thread will be
a particular place in the code, and I can diagnose it that way.

The first order of business is therefore to get a thread dump when the
system is "hung".  That will help confirm the picture.  There are a
number of additional questions here.

(1) Why is this happening?  Is there any possibility that the Oracle
database you are crawling is (very occasionally) not able to properly
respond to a JDBC query?  I can imagine that, under some network
conditions, it might be possible for the Oracle JDBC driver to wind up
waiting indefinitely for a response that never comes.

(2) Given that we can't always control the infrastructure we're trying
to crawl through, should we attempt to provide a reasonable
workaround?  For example, a timeout on JDBC connector queries, where
we throw a ServiceInterruption if the timeout is exceeded?

Karl

On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
<anthony.leonard@york.ac.uk> wrote:
> Hi there,
>
> We have recently started running a nightly job 2AM in ManifoldCF to extract
> data from an Oracle repository and populate a Solr index. Most nights this
> works fine, but occasionally the job has been hanging at the "Starting up"
> phase. We have observed this on our test setup also occasionally. A restart
> of ManifoldCF usually solves this.
>
> Using the simple history reports today I looked up all records and sorted
> them by the "Time" column, largest first, and found the following:
>
> "Start Time","Activity","Identifier","Result Code","Bytes","Time","Result
> Description"
> "11-12-2012 05:00:05.941","external query" "... SQL QUERY
> ...","ERROR","0","1926607529","Interrupted: null"
> "01-21-2013 02:00:11.843","external query" "... SQL QUERY
> ...","ERROR","0","31644956","Interrupted: null"
> "01-17-2013 02:00:03.600","external query" "... SQL QUERY
> ...","ERROR","0","31637594","Interrupted: null"
> "12-04-2012 12:12:19.860","external query" "... SQL QUERY
> ...","OK","0","17511",""
> ... etc ...
>
> If the Time column is in millis that means the first query was hanging for
> 22 days! (This was in the period before we went live when our live server
> was sitting idle for a while.) The other two occasions it was hanging for
> about 8 hours until we arrived to restart the job in the morning. I have
> confirmed that the Oracle database we are connecting to was available
> throughout these periods. These times are also too long for any network or
> database timeouts, which makes me suspect that it's a problem with the
> application.
>
> We have the following logging config in properties.xml
>
>   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>
> The job failed again last night and when I checked at 10:40 AM this morning
> the last few lines of manifoldcf.log were:
>
> DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
> 1352455005553 needs to be started; it was last checked at 1358733575454, and
> now it is 1358733585635
> DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match found
> within interval 1358733575454 to 1358733585635
> DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
> 1352455005553 needs to be started; it was last checked at 1358733585636, and
> now it is 1358733595662
> DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match found
> within interval 1358733585636 to 1358733595662
> DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
> 1352455005553 needs to be started; it was last checked at 1358733595663, and
> now it is 1358733605813
> DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND within
> interval 1358733595663 to 1358733605813
> DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553' is
> within run window at 1358733605813 ms. (which starts at 1358733600000 ms.)
> DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job start
> for job 1352455005553
> DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job 1352455005553
> for startup
> DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect string is
> 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>
> After that - nothing. Once I restarted manifoldcf this morning the job
> magically restarted itself and the following log messages were added where
> it had left off:
>
> DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job 1352455005553
> back to 'ReadyForStartup' state
> DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
> DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
> DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job 1352455005553
> for startup
> DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string is
> 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is now
> started
> ... etc ...
>
> So it appears that the application is running fine with a Job start thread
> logging nicely every ten seconds until it tries to start this job and hangs
> entirely until the system is shutdown - even though it performed that same
> tasks perfectly the night before and later the same day.
>
> Can anyone advise on what might be happening here? We are running ManifoldCF
> version 1.0.1 on Ubuntu 10.04.
>
> Best wishes,
> Anthony.
>
> --
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010

Mime
View raw message