manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Error: Unexpected jobqueue status
Date Wed, 04 Feb 2015 16:08:31 GMT
It's also possible that there are no log messages from the time of the
shutdown attempt -- that too would be good to know.  That would mean that
the Postgresql driver is interpreting interruption with a specific
SQLException code.  In that case we may need to instrument the code in
order to determine what the PostgreSQL driver is up to, and it may be
difficult to distinguish between thread interruption and a legitimate SQL
error.

Karl


On Wed, Feb 4, 2015 at 10:53 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Adrian,
>
> There should be log messages corresponding to this line:
>
> >>>>>>
>           // Log it, but keep the thread alive
>           Logging.threads.error("Exception tossed: "+e.getMessage(),e);
> <<<<<<
>
> Can you find any of those, and if so, please send me the logged exception?
>
> Thanks!
> Karl
>
>
> On Wed, Feb 4, 2015 at 10:50 AM, Adrian Conlon <Adrian.Conlon@arup.com>
> wrote:
>
>>  Thanks Karl, I’m glad I found you an interesting one!
>>
>>
>>
>> Reasonably fast agent shutdown would definitely be useful for scheduled
>> DB maintenance tasks…
>>
>>
>>
>> Adrian
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* 04 February 2015 15:38
>>
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: Error: Unexpected jobqueue status
>>
>>
>>
>> CONNECTORS-1156.
>>
>> Karl
>>
>>
>>
>> On Wed, Feb 4, 2015 at 10:33 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>   Hi Adrian,
>>
>> Clearly, the shutdown is happening at a time when ManifoldCF is in the
>> midst of an ANALYZE TABLE operation.  The shutdown is attempting to
>> interrupt this operation, and is probably succeeding, but since it is
>> happening when a transaction is being closed, it apparently winds up
>> retrying the whole thing at that point -- which goes right back into
>> ANALYZE, only to be interrupted again one second later.
>>
>> I still can't see why it retries, but I should be able to analyze that
>> much from here.  I'll create a ticket for the problem.
>>
>> Karl
>>
>>
>>
>> On Wed, Feb 4, 2015 at 10:26 AM, Adrian Conlon <Adrian.Conlon@arup.com>
>> wrote:
>>
>>  At that point, only a few (three or so) minutes.
>>
>>
>>
>> I left it for another 5 or 6 six minutes after I grabbed the stack trace
>> before I finally ‘-9’ed it.
>>
>>
>>
>> HTH,
>>
>>
>>
>> Adrian
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* 04 February 2015 15:21
>>
>>
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: Error: Unexpected jobqueue status
>>
>>
>>
>> Thanks for the trace.
>>
>> Here's what I get out of it:
>>
>> The shutdown thread is waiting for all the threads to terminate:
>>
>> >>>>>>
>> "Shutdown thread" prio=10 tid=0x00007fe740114000 nid=0xfa9 in
>> Object.wait() [0x00007fe710116000]
>>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>     at java.lang.Object.wait(Native Method)
>>     at
>> org.apache.manifoldcf.core.system.ManifoldCF.sleep(ManifoldCF.java:1048)
>>     - locked <0x00000000e70cf2e8> (a java.lang.Integer)
>>     at
>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopSystem(CrawlerAgent.java:617)
>>     at
>> org.apache.manifoldcf.crawler.system.CrawlerAgent.stopAgent(CrawlerAgent.java:249)
>>     at
>> org.apache.manifoldcf.agents.system.AgentsDaemon.stopAgents(AgentsDaemon.java:168)
>>     - locked <0x00000000eafa3460> (a java.util.HashMap)
>>     at
>> org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsShutdownHook.doCleanup(AgentsDaemon.java:395)
>>     at
>> org.apache.manifoldcf.core.system.ManifoldCF.cleanUpEnvironment(ManifoldCF.java:1340)
>>     - locked <0x00000000eafcb208> (a java.util.ArrayList)
>>     - locked <0x00000000eafcb2a0> (a java.lang.Integer)
>>     at
>> org.apache.manifoldcf.core.system.ManifoldCF$ShutdownThread.run(ManifoldCF.java:1565)
>> <<<<<<
>>
>> There are a lot of Zookeeper threads still alive, but those don't matter
>> here.  There is precisely one thread that is blocking shutdown:
>>
>> >>>>>>
>> "Startup thread" daemon prio=10 tid=0x00007fe73012f000 nid=0x340b in
>> Object.wait() [0x00007fe71f7f6000]
>>    java.lang.Thread.State: WAITING (on object monitor)
>>     at java.lang.Object.wait(Native Method)
>>     at java.lang.Thread.join(Thread.java:1281)
>>     - locked <0x00000000e851a0b8> (a
>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread)
>>     at java.lang.Thread.join(Thread.java:1355)
>>     at
>> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.finishUp(Database.java:694)
>>     at
>> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:728)
>>     at
>> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:790)
>>     at
>> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1444)
>>     at
>> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
>>     at
>> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:191)
>>     at
>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:656)
>>     at
>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.analyzeTableInternal(DBInterfacePostgreSQL.java:1431)
>>     at
>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.noteModificationsNoTransactions(DBInterfacePostgreSQL.java:1576)
>>     at
>> org.apache.manifoldcf.core.database.Database.playbackModifications(Database.java:429)
>>     at
>> org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:414)
>>     at
>> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1231)
>>     at
>> org.apache.manifoldcf.crawler.jobs.JobManager.resetStartupJob(JobManager.java:7575)
>>     at
>> org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:238)
>> <<<<<<
>>
>> This is performing a database ANALYZE TABLE operation, which *should* be
>> interruptible, given the trace, but is apparently *not*.  I'll have to look
>> at this to find a reason why it won't interrupt.  Offhand, I can see no
>> reason for it.
>>
>> When you got this thread dump, how long had the system been waiting to
>> shut down?
>>
>> Karl
>>
>>
>>
>> On Wed, Feb 4, 2015 at 9:04 AM, Adrian Conlon <Adrian.Conlon@arup.com>
>> wrote:
>>
>>  Hi Karl,
>>
>>
>>
>> It’s taken me a little while before I needed to do some work on the
>> server, but here’s a ‘jstack’ dump of an agents run that’s not responding
>> to a stop request.
>>
>>
>>
>> Hope it’s helpful,
>>
>>
>>
>> Adrian
>>
>>
>>
>> *From:* Adrian Conlon [mailto:Adrian.Conlon@arup.com]
>> *Sent:* 12 January 2015 15:35
>> *To:* user@manifoldcf.apache.org
>> *Subject:* RE: Error: Unexpected jobqueue status
>>
>>
>>
>> Thanks Karl.  Restarting the job manually did fix the problem.
>>
>>
>>
>> I might add a check for this in my software and kick the job into life
>> again automatically, now I know it works…
>>
>>
>>
>> Adrian
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com <daddywri@gmail.com>]
>> *Sent:* 12 January 2015 11:34
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: Error: Unexpected jobqueue status
>>
>>
>>
>> Hi Adrian,
>>
>> Just restarting the job should be sufficient to get it sorted out after
>> this kind of failure.
>>
>> Karl
>>
>>
>>
>> On Mon, Jan 12, 2015 at 6:19 AM, Adrian Conlon <Adrian.Conlon@arup.com>
>> wrote:
>>
>>  Thanks Karl.
>>
>>
>>
>> With regards the runtime environment, apologies for the omission of
>> Postgresql version.  It’s v9.3.
>>
>>
>>
>> For the stack trace, I’ve just installed a jdk on the problematic server
>> and tried out “jstack” (that’s a neat tool!), so I’m all systems go for the
>> next time agents process doesn’t respond to a stop request.
>>
>>
>>
>> With regards jobs that have this unexpected “jobqueue” status; do they
>> sort themselves out the next time the job runs?  Is there anything I should
>> do to “help” the job along?
>>
>>
>>
>> Adrian
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* 12 January 2015 00:27
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: Error: Unexpected jobqueue status
>>
>>
>>
>> Also, if you are having trouble shutting down the agents process, it
>> would be great if you could get a thread dump and post it, before you kill
>> -9 it.
>>
>> Karl
>>
>>
>>
>> On Sun, Jan 11, 2015 at 7:25 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>  Hi Adrian,
>>
>> If you noted the comment stream in CONNECTORS-590, I was able to
>> demonstrate conclusively that the problem was in Postgres.  I have not seen
>> the problem in 9.3, but that does not mean it's gone.  What version of
>> Postgresql are you using?
>>
>> In any case, while this problem definitely terminates your job, it will
>> not happen very often.  I suspect the frequency of occurrence may depend on
>> how loaded the database is.
>>
>> Karl
>>
>>
>>
>> On Sun, Jan 11, 2015 at 7:14 PM, Adrian Conlon <Adrian.Conlon@arup.com>
>> wrote:
>>
>>  Hi All,
>>
>>
>>
>> I’m getting an occurrence of what looks very similar to CONNECTORS-590.
>>
>>
>>
>> The circumstances are:
>>
>>
>>
>> 1)      MCF Jobs proceeding very slowly (looks like a Postgresql vacuum
>> is needed)
>>
>> 2)      Stop tomcat
>>
>> 3)      Attempt to stop the agents normally
>>
>> 4)      Wait a minute or two
>>
>> 5)      Decide to “kill -9” the agents process
>>
>> 6)      Vacuum the database
>>
>> 7)      Restart tomcat
>>
>> 8)      Restart the agents
>>
>>
>>
>> When I checked the job status page, I found that two of the jobs (out
>> around 4000 or so) had the following status (or very similar):
>>
>>
>>
>> Error: Unexpected jobqueue status - record id 1417115392831, expecting
>> active status, saw 4
>>
>>
>>
>> Setup-wise, I’m running a release candidate of v1.8 RC (I think RC2),
>> using postresql as the crawl database and running on Ubuntu Linux.  I’m
>> using zookeeper style synchronisation.
>>
>>
>>
>> Let me know if more information etc. is needed or if you think it’s a
>> new/real issue.
>>
>> Adrian
>>
>> ____________________________________________________________
>> Electronic mail messages entering and leaving Arup  business
>> systems are scanned for acceptability of content and viruses
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Mime
View raw message