manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Fri, 25 Jan 2013 17:09:15 GMT
You can download the current release candidate for 1.1 (RC6) from
http://people.apache.org/~kwright/apache-manifoldcf-1.1 .

Karl


On Fri, Jan 25, 2013 at 12:02 PM, Anthony Leonard
<anthony.leonard@york.ac.uk> wrote:
> Hi Karl,
>
> Thank you so much for this. Sorry for the lack of response as we've been
> working on other things.
>
> One question - would we have to build ManifoldCF ourselves to get the new
> code you've checked in or would it already be part of a binary distribution
> somewhere?
>
> Best wishes,
> Anthony.
>
> --
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Tue, Jan 22, 2013 at 6:51 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> I've checked in code in both trunk and the release branch for this issue.
>>
>> It would be good if you could try this again in your environment.  The
>> fix simply prevents some kinds of exceptions from causing a handle
>> leak.  Please try this with only 1 JDBC Connection connection handle
>> per JVM and let me know if you see any hangs.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Jan 22, 2013 at 1:11 PM, Karl Wright <daddywri@gmail.com> wrote:
>> > I've looked into the code in some detail.  There is indeed a place
>> > where it is possible for a JDBC connection handle to be leaked, I
>> > believe.  However, it's not clear whether this is the circumstance you
>> > are encountering or not, since it does in involve an exception getting
>> > thrown doing something not terribly likely to cause exceptions.
>> >
>> > I've opened a ticket - CONNECTORS-620.
>> >
>> > Karl
>> >
>> > On Tue, Jan 22, 2013 at 12:53 PM, Karl Wright <daddywri@gmail.com>
>> > wrote:
>> >> Hmm.
>> >>
>> >> The following threads are of interest here:
>> >>
>> >> Thread 29975: (state = BLOCKED)
>> >>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> >> be imprecise)
>> >>  - java.lang.Thread.join(long) @bci=38, line=1203 (Compiled frame)
>> >>  - java.lang.Thread.join() @bci=2, line=1256 (Compiled frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$JDBCPSResultSet.<init>(org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection,
>> >> java.lang.String, java.util.ArrayList, int) @bci=39, line=1058
>> >> (Interpreted frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection.executeUncachedQuery(java.lang.String,
>> >> java.util.ArrayList, int) @bci=23, line=256 (Interpreted frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector.addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity,
>> >> org.apache.manifoldcf.crawler.interfaces.DocumentSpecification, long,
>> >> long, int) @bci=106, line=246 (Interpreted frame)
>> >>  - org.apache.manifoldcf.crawler.system.StartupThread.run() @bci=636,
>> >> line=179 (Interpreted frame)
>> >>
>> >> ... which is probably waiting in this one:
>> >>
>> >> Thread 24457: (state = BLOCKED)
>> >>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> >> be imprecise)
>> >>  - java.lang.Object.wait() @bci=2, line=502 (Interpreted frame)
>> >>  - org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection()
>> >> @bci=80, line=80 (Interpreted frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnectionFactory.getConnection(java.lang.String,
>> >> java.lang.String, java.lang.String, java.lang.String,
>> >> java.lang.String) @bci=433, line=128 (Interpreted frame)
>> >>  -
>> >> org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnection$PreparedStatementQueryThread.run()
>> >> @bci=36, line=1212 (Interpreted frame)
>> >>
>> >> ... which is waiting to obtain a JDBC connection, and the reason it
>> >> can't is because it thinks that the only available JDBC connection is
>> >> currently in use.
>> >>
>> >> Since you have only a single connection around, and nothing else is
>> >> active, it stands to reason that a JDBC connection handle has somehow
>> >> been leaked, which is a challenge since connections are typically
>> >> freed in a try/finally block through ManifoldCF.
>> >>
>> >> I notice that your stack frames are pretty unusual - what JDK is this
>> >> that you are using?
>> >>
>> >> Karl
>> >>
>> >>
>> >> Since
>> >> On Tue, Jan 22, 2013 at 12:00 PM, Anthony Leonard
>> >> <anthony.leonard@york.ac.uk> wrote:
>> >>> Dear Karl,
>> >>>
>> >>> Our DBA noticed that each time our job was run 10 Oracle connections
>> >>> were
>> >>> created. So, we dropped the "Max connections" parameter on the
>> >>> repository
>> >>> connection config to 1 and re-ran the job with the DBA watching. The
>> >>> job
>> >>> worked fine but the DBA reported that 1 connection was created and
>> >>> then 10
>> >>> more briefly ...
>> >>>
>> >>> Out of curiosity we re-ran the job again with no further changes and
>> >>> this
>> >>> time got the following results:
>> >>>
>> >>> * the job hung in the "Starting Up" phase again, with the same logging
>> >>> and
>> >>> symptoms as detailed before on this thread.
>> >>> * the DBA reported seeing no connections at all this time.
>> >>> * I have attached a thread dump created by "jstack -F <pid>".
This is
>> >>> reporting all threads as blocked.
>> >>>
>> >>> Any ideas? Any help with this would certainly be very gratefully
>> >>> received.
>> >>>
>> >>> Best wishes,
>> >>> Anthony.
>> >>>
>> >>> --
>> >>> Dr Anthony Leonard
>> >>> System Integrator, Information Directorate
>> >>> University of York, Heslington, York, UK, YO10 5DD
>> >>> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> >>> Times Higher Education University of the Year 2010
>> >>>
>> >>>
>> >>> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> kill -QUIT should not abort the agents process, just cause a thread
>> >>>> dump.  kill -9 is a different story.
>> >>>>
>> >>>> You can also do the same thing by using jstack, in the jvm bin
>> >>>> directory.
>> >>>>
>> >>>> Karl
>> >>>>
>> >>>>
>> >>>> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>> >>>> <anthony.leonard@york.ac.uk> wrote:
>> >>>> > Dear Karl,
>> >>>> >
>> >>>> > Many thanks for your insights. I'll do a kill -QUIT next time
we
>> >>>> > have
>> >>>> > this
>> >>>> > issue which should hopefully give me the thread dump. However
we've
>> >>>> > noticed
>> >>>> > that killing processes means we have to run the "locks-clean"
>> >>>> > script so
>> >>>> > it's
>> >>>> > not our favourite way of doing it.
>> >>>> >
>> >>>> > Also I definitely think a timeout for queries would be a good
>> >>>> > thing.
>> >>>> >
>> >>>> > I guess we go back to checking that the connection to the database
>> >>>> > should
>> >>>> > have been ok last night...
>> >>>> >
>> >>>> > Best wishes,
>> >>>> > Anthony.
>> >>>> >
>> >>>> > --
>> >>>> > Dr Anthony Leonard
>> >>>> > System Integrator, Information Directorate
>> >>>> > University of York, Heslington, York, UK, YO10 5DD
>> >>>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> >>>> > Times Higher Education University of the Year 2010
>> >>>> >
>> >>>> >
>> >>>> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <daddywri@gmail.com>
>> >>>> > wrote:
>> >>>> >>
>> >>>> >> Hi Anthony,
>> >>>> >>
>> >>>> >> What happens between the framework recognizing that the
job should
>> >>>> >> be
>> >>>> >> started (which it does fine in both cases), and actually
achieving
>> >>>> >> a
>> >>>> >> correct job start, is the "seeding" phase, which is going
to try
>> >>>> >> to
>> >>>> >> execute the seeding query against your Oracle database.
 If
>> >>>> >> something
>> >>>> >> happens at that time to hang the JDBC connection's seeding
query,
>> >>>> >> then
>> >>>> >> it precisely explains the behavior you are seeing.
>> >>>> >>
>> >>>> >> It is also the case that the timeout on the queries that
the JDBC
>> >>>> >> connector does is effectively infinite.  This makes me
suspicious
>> >>>> >> that
>> >>>> >> what is happening is an Oracle query is going out but there
is no
>> >>>> >> response ever coming back.
>> >>>> >>
>> >>>> >> The other possibility is that the JDBC connector is in
fact
>> >>>> >> correctly
>> >>>> >> throwing a ServiceInterruption, but that the ManifoldCF
code is
>> >>>> >> either
>> >>>> >> not handling it properly, or the connector is not forming
it
>> >>>> >> properly.
>> >>>> >>  In that case, when you notice a hung job, the startup
thread will
>> >>>> >> be
>> >>>> >> a particular place in the code, and I can diagnose it that
way.
>> >>>> >>
>> >>>> >> The first order of business is therefore to get a thread
dump when
>> >>>> >> the
>> >>>> >> system is "hung".  That will help confirm the picture.
 There are
>> >>>> >> a
>> >>>> >> number of additional questions here.
>> >>>> >>
>> >>>> >> (1) Why is this happening?  Is there any possibility that
the
>> >>>> >> Oracle
>> >>>> >> database you are crawling is (very occasionally) not able
to
>> >>>> >> properly
>> >>>> >> respond to a JDBC query?  I can imagine that, under some
network
>> >>>> >> conditions, it might be possible for the Oracle JDBC driver
to
>> >>>> >> wind up
>> >>>> >> waiting indefinitely for a response that never comes.
>> >>>> >>
>> >>>> >> (2) Given that we can't always control the infrastructure
we're
>> >>>> >> trying
>> >>>> >> to crawl through, should we attempt to provide a reasonable
>> >>>> >> workaround?  For example, a timeout on JDBC connector queries,
>> >>>> >> where
>> >>>> >> we throw a ServiceInterruption if the timeout is exceeded?
>> >>>> >>
>> >>>> >> Karl
>> >>>> >>
>> >>>> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>> >>>> >> <anthony.leonard@york.ac.uk> wrote:
>> >>>> >> > Hi there,
>> >>>> >> >
>> >>>> >> > We have recently started running a nightly job 2AM
in ManifoldCF
>> >>>> >> > to
>> >>>> >> > extract
>> >>>> >> > data from an Oracle repository and populate a Solr
index. Most
>> >>>> >> > nights
>> >>>> >> > this
>> >>>> >> > works fine, but occasionally the job has been hanging
at the
>> >>>> >> > "Starting
>> >>>> >> > up"
>> >>>> >> > phase. We have observed this on our test setup also
>> >>>> >> > occasionally. A
>> >>>> >> > restart
>> >>>> >> > of ManifoldCF usually solves this.
>> >>>> >> >
>> >>>> >> > Using the simple history reports today I looked up
all records
>> >>>> >> > and
>> >>>> >> > sorted
>> >>>> >> > them by the "Time" column, largest first, and found
the
>> >>>> >> > following:
>> >>>> >> >
>> >>>> >> > "Start Time","Activity","Identifier","Result
>> >>>> >> > Code","Bytes","Time","Result
>> >>>> >> > Description"
>> >>>> >> > "11-12-2012 05:00:05.941","external query" "... SQL
QUERY
>> >>>> >> > ...","ERROR","0","1926607529","Interrupted: null"
>> >>>> >> > "01-21-2013 02:00:11.843","external query" "... SQL
QUERY
>> >>>> >> > ...","ERROR","0","31644956","Interrupted: null"
>> >>>> >> > "01-17-2013 02:00:03.600","external query" "... SQL
QUERY
>> >>>> >> > ...","ERROR","0","31637594","Interrupted: null"
>> >>>> >> > "12-04-2012 12:12:19.860","external query" "... SQL
QUERY
>> >>>> >> > ...","OK","0","17511",""
>> >>>> >> > ... etc ...
>> >>>> >> >
>> >>>> >> > If the Time column is in millis that means the first
query was
>> >>>> >> > hanging
>> >>>> >> > for
>> >>>> >> > 22 days! (This was in the period before we went live
when our
>> >>>> >> > live
>> >>>> >> > server
>> >>>> >> > was sitting idle for a while.) The other two occasions
it was
>> >>>> >> > hanging
>> >>>> >> > for
>> >>>> >> > about 8 hours until we arrived to restart the job
in the
>> >>>> >> > morning. I
>> >>>> >> > have
>> >>>> >> > confirmed that the Oracle database we are connecting
to was
>> >>>> >> > available
>> >>>> >> > throughout these periods. These times are also too
long for any
>> >>>> >> > network
>> >>>> >> > or
>> >>>> >> > database timeouts, which makes me suspect that it's
a problem
>> >>>> >> > with
>> >>>> >> > the
>> >>>> >> > application.
>> >>>> >> >
>> >>>> >> > We have the following logging config in properties.xml
>> >>>> >> >
>> >>>> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>> >>>> >> >   <property name="org.apache.manifoldcf.connectors"
>> >>>> >> > value="ALL"/>
>> >>>> >> >   <property name="org.apache.manifoldcf.agents"
value="ALL"/>
>> >>>> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>> >>>> >> >
>> >>>> >> > The job failed again last night and when I checked
at 10:40 AM
>> >>>> >> > this
>> >>>> >> > morning
>> >>>> >> > the last few lines of manifoldcf.log were:
>> >>>> >> >
>> >>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -
Checking if
>> >>>> >> > job
>> >>>> >> > 1352455005553 needs to be started; it was last checked
at
>> >>>> >> > 1358733575454,
>> >>>> >> > and
>> >>>> >> > now it is 1358733585635
>> >>>> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -
 No time
>> >>>> >> > match
>> >>>> >> > found
>> >>>> >> > within interval 1358733575454 to 1358733585635
>> >>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -
Checking if
>> >>>> >> > job
>> >>>> >> > 1352455005553 needs to be started; it was last checked
at
>> >>>> >> > 1358733585636,
>> >>>> >> > and
>> >>>> >> > now it is 1358733595662
>> >>>> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -
 No time
>> >>>> >> > match
>> >>>> >> > found
>> >>>> >> > within interval 1358733585636 to 1358733595662
>> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -
Checking if
>> >>>> >> > job
>> >>>> >> > 1352455005553 needs to be started; it was last checked
at
>> >>>> >> > 1358733595663,
>> >>>> >> > and
>> >>>> >> > now it is 1358733605813
>> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -
 Time match
>> >>>> >> > FOUND
>> >>>> >> > within
>> >>>> >> > interval 1358733595663 to 1358733605813
>> >>>> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -
Job
>> >>>> >> > '1352455005553'
>> >>>> >> > is
>> >>>> >> > within run window at 1358733605813 ms. (which starts
at
>> >>>> >> > 1358733600000
>> >>>> >> > ms.)
>> >>>> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) -
Signalled for
>> >>>> >> > job
>> >>>> >> > start
>> >>>> >> > for job 1352455005553
>> >>>> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked
job
>> >>>> >> > 1352455005553
>> >>>> >> > for startup
>> >>>> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC:
The
>> >>>> >> > connect
>> >>>> >> > string
>> >>>> >> > is
>> >>>> >> >
>> >>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >>>> >> >
>> >>>> >> > After that - nothing. Once I restarted manifoldcf
this morning
>> >>>> >> > the
>> >>>> >> > job
>> >>>> >> > magically restarted itself and the following log messages
were
>> >>>> >> > added
>> >>>> >> > where
>> >>>> >> > it had left off:
>> >>>> >> >
>> >>>> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting
job
>> >>>> >> > 1352455005553
>> >>>> >> > back to 'ReadyForStartup' state
>> >>>> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting
due to
>> >>>> >> > restart
>> >>>> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset
complete
>> >>>> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked
job
>> >>>> >> > 1352455005553
>> >>>> >> > for startup
>> >>>> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC:
The connect
>> >>>> >> > string
>> >>>> >> > is
>> >>>> >> >
>> >>>> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >>>> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job
>> >>>> >> > 1352455005553 is
>> >>>> >> > now
>> >>>> >> > started
>> >>>> >> > ... etc ...
>> >>>> >> >
>> >>>> >> > So it appears that the application is running fine
with a Job
>> >>>> >> > start
>> >>>> >> > thread
>> >>>> >> > logging nicely every ten seconds until it tries to
start this
>> >>>> >> > job and
>> >>>> >> > hangs
>> >>>> >> > entirely until the system is shutdown - even though
it performed
>> >>>> >> > that
>> >>>> >> > same
>> >>>> >> > tasks perfectly the night before and later the same
day.
>> >>>> >> >
>> >>>> >> > Can anyone advise on what might be happening here?
We are
>> >>>> >> > running
>> >>>> >> > ManifoldCF
>> >>>> >> > version 1.0.1 on Ubuntu 10.04.
>> >>>> >> >
>> >>>> >> > Best wishes,
>> >>>> >> > Anthony.
>> >>>> >> >
>> >>>> >> > --
>> >>>> >> > Dr Anthony Leonard
>> >>>> >> > System Integrator, Information Directorate
>> >>>> >> > University of York, Heslington, York, UK, YO10 5DD
>> >>>> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> >>>> >> > Times Higher Education University of the Year 2010
>> >>>> >
>> >>>> >
>> >>>
>> >>>
>
>

Mime
View raw message