manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikita Ahuja <nik...@smartshore.nl>
Subject Re: Exception in the running Custom Job
Date Wed, 29 Aug 2018 09:18:31 GMT
Hi Karl,


Yes the documents are ingesting in the output connector without any error.
But after executing the process for about 2k-3k documents the service
crashes and displays message for "Out Of Memory".

The  checkLengthIndexable() method is used first before ingesting the
document.

Please have look on the attachment for the methods which might are the
problem area.

On Wed, Aug 29, 2018 at 1:44 PM, Karl Wright <daddywri@gmail.com> wrote:

> So the Allowed Document transformer is now working, and your connector is
> now skipping documents that are too large, correct?  But you are still
> seeing out of memory errors?
>
> Does your connector load the entire document into memory before it calls
> checkLengthIndexable()?  Because if it does, that will not work.  There is
> a reason that connectors are constructed to stream data in MCF.
>
> It might be faster to diagnose your problem if you made the source code
> available so that I could audit it.
>
> Karl
>
>
> On Wed, Aug 29, 2018 at 2:42 AM Nikita Ahuja <nikita@smartshore.nl> wrote:
>
>> Hi Karl,
>>
>> The result for both the Length and checkLengthIndexable() method is
>> same. And the Allowed  Document is also working. But main problem is
>> crashing down of the service and it displays memory Leakage error every
>> time after crawling few set of documents..
>>
>>
>>
>> On Tue, Aug 28, 2018 at 6:48 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Can you add logging messages to your connector to log (1) the length
>>> that it sees, and (2) the result of checkLengthIndexable()?  And then,
>>> please once again add the Allowed Documents transformer and set a
>>> reasonable document length.  Run the job and see why it is rejecting your
>>> documents.
>>>
>>> All of our shipping connectors use this logic and it does work, so I am
>>> rather certain that the problem is in your connector.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Tue, Aug 28, 2018 at 8:54 AM Nikita Ahuja <nikita@smartshore.nl>
>>> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> Thank you for valuable suggestion.
>>>>
>>>> The checkLengthIndexable() value is also used in the code and it is
>>>> returning the exact value for document length.
>>>>
>>>> Also garbage collector and disposing for the threads is used.
>>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2018 at 5:44 PM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> I don't see checkLengthIndexable() in this list.  You need to add that
>>>>> if you want your connector to be able to not try and index documents
that
>>>>> are too big.
>>>>>
>>>>> You said before that when you added the Allowed Documents transformer
>>>>> to the chain it removed ALL documents, so I suspect it's there but you
are
>>>>> not sending in the actual document length.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2018 at 8:10 AM Nikita Ahuja <nikita@smartshore.nl>
>>>>> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> These methods are already in use with the connector in the code where
>>>>>> file is need to read and ingest in the output.
>>>>>>
>>>>>> (!activities.checkURLIndexable(fileUrl))
>>>>>> (!activities.checkMimeTypeIndexable(contentType))
>>>>>> (!activities.checkDateIndexable(modifiedDate))
>>>>>>
>>>>>>
>>>>>> But this service crashes after crawling approx 2000 documents.
>>>>>>
>>>>>> I think there is some other thing hitting it and creating problem.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 24, 2018 at 8:33 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Nikita,
>>>>>>>
>>>>>>> Until you fix your connector, nothing can be done to address
your
>>>>>>> Out Of Memory problem.
>>>>>>>
>>>>>>> The problem is that you are not calling the following
>>>>>>> IProcessActivity method:
>>>>>>>
>>>>>>>   /** Check whether a document of a specific length is indexable
by
>>>>>>> the currently specified output connector.
>>>>>>>   *@param length is the document length.
>>>>>>>   *@return true if the document is indexable.
>>>>>>>   */
>>>>>>>   public boolean checkLengthIndexable(long length)
>>>>>>>     throws ManifoldCFException, ServiceInterruption;
>>>>>>>
>>>>>>> Your connector should call this and honor the response.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 24, 2018 at 9:55 AM Nikita Ahuja <nikita@smartshore.nl>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> I have checked for the coding error, there is nothing like
that
>>>>>>>> as"Allowed Document" is working fine for same code on the
other system.
>>>>>>>>
>>>>>>>> But now main issue being faced is "Shutting down of the ManifoldCF"
>>>>>>>> and it shows *"java.lang.OutOfMemoryError: GC overhead limit
>>>>>>>> exceeded" on the system.*
>>>>>>>>
>>>>>>>> Postgresql is being used for Manifoldcf and the memory alloted
for
>>>>>>>> the system is very good, but still this issue is faced very
frequently.
>>>>>>>> Throttling(2) and Worker thread size"45" is also being checked
and
>>>>>>>> as per the documentation it is checked for different values.
>>>>>>>>
>>>>>>>>
>>>>>>>> Please suggest the possible problem area and steps to be
taken.
>>>>>>>>
>>>>>>>> On Mon, Aug 20, 2018 at 7:30 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Obviously your Allowed Documents filter is somehow causing
all
>>>>>>>>> documents to be excluded.  Since you have a custom repository
connector I
>>>>>>>>> would bet there is a coding error in it that is responsible.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Aug 20, 2018 at 8:49 AM Nikita Ahuja <nikita@smartshore.nl>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl,
>>>>>>>>>>
>>>>>>>>>> Thanks for reply.
>>>>>>>>>>
>>>>>>>>>> I am using in the same sequence. The allowed document
is added
>>>>>>>>>> first and then the Tika Transformation.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But nothing runs in that scenario. The job simply
ends without
>>>>>>>>>> returning anything in the output.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 20, 2018 at 5:36 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> You are running out of memory.
>>>>>>>>>>> Tika's memory consumption is not well defined
so you will need
>>>>>>>>>>> to limit the size of documents that reach it.
 This is not the same as
>>>>>>>>>>> limiting the size of documents *after* Tika extracts
them.
>>>>>>>>>>>
>>>>>>>>>>> The Allowed Documents transformer therefore should
be placed in
>>>>>>>>>>> the pipeline before the Tika Extractor.
>>>>>>>>>>>
>>>>>>>>>>> "Also it is not compatible with the Allowed Documents
and
>>>>>>>>>>> Metadata Adjuster Connectors."
>>>>>>>>>>>
>>>>>>>>>>> This is a huge red flag.  Why not?
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 20, 2018 at 6:47 AM Nikita Ahuja
<
>>>>>>>>>>> nikita@smartshore.nl> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>
>>>>>>>>>>>> There is a custom job executing for Aconex
in the ManifoldCF
>>>>>>>>>>>> environment. But while executing it is not
able to crawl complete set of
>>>>>>>>>>>> documents. It crashes in the middle of the
execution.
>>>>>>>>>>>>
>>>>>>>>>>>> Also it is not compatible with the Allowed
Documents and
>>>>>>>>>>>> Metadata Adjuster Connectors.
>>>>>>>>>>>>
>>>>>>>>>>>> The custom job created is similar to the
existing Jira
>>>>>>>>>>>> connector in the ManifoldCF.
>>>>>>>>>>>>
>>>>>>>>>>>> And it showing this type of error. Please
suggest appropriate
>>>>>>>>>>>> steps which needs to be followed to make
it smoothly running.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>>>>>>>>>>> [uk1.aconex.co.uk/---.---.---.---
>>>>>>>>>>>> <http://uk1.aconex.co.uk/---.---.---.--->]
failed: Read timed out*
>>>>>>>>>>>> *agents process ran out of memory - shutting
down*
>>>>>>>>>>>> *agents process ran out of memory - shutting
down*
>>>>>>>>>>>> *agents process ran out of memory - shutting
down*
>>>>>>>>>>>> *agents process ran out of memory - shutting
down*
>>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.core.database.Database.beginTransaction(Database.java:240)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1361)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1327)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.crawler.jobs.JobManager.assessMarkedJobs(JobManager.java:823)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.AssessmentThread.run(AssessmentThread.java:65)*
>>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.clone(PDGraphicsState.java:494)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.saveGraphicsState(PDFStreamEngine.java:898)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:721)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:587)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:55)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:168)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexSession.fetchAndIndexFile(AconexSession.java:720)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexRepositoryConnector.processDocuments(AconexRepositoryConnector.java:1194)*
>>>>>>>>>>>> *        at
>>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)*
>>>>>>>>>>>> *[Thread-431] INFO org.eclipse.jetty.server.ServerConnector
-
>>>>>>>>>>>> Stopped ServerConnector@2c0b4c83{HTTP/1.1}{0.0.0.0:8345
>>>>>>>>>>>> <http://0.0.0.0:8345>}*
>>>>>>>>>>>> *[Thread-431] INFO
>>>>>>>>>>>> org.eclipse.jetty.server.handler.ContextHandler
- Stopped
>>>>>>>>>>>> o.e.j.w.WebAppContext@4c03a37{/mcf-api-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-api-service.war-_mcf-api-service-any-3117653580650249372.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-api-service.war}*
>>>>>>>>>>>> *[Thread-431] INFO
>>>>>>>>>>>> org.eclipse.jetty.server.handler.ContextHandler
- Stopped
>>>>>>>>>>>> o.e.j.w.WebAppContext@65ae095c{/mcf-authority-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-authority-service.war-_mcf-authority-service-any-8288503227579256193.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-authority-service.war}*
>>>>>>>>>>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>>>>>>>>>>> [uk1.aconex.co.uk/23.10.35.84 <http://uk1.aconex.co.uk/23.10.35.84>]
>>>>>>>>>>>> failed: Read timed out*
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>> Nikita
>>>>>>>>>>>> Email: nikita@smartshore.nl
>>>>>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>>>>>> a "Smartshore" Company
>>>>>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>>>>>> http://www.smartshore.nl
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks and Regards,
>>>>>>>>>> Nikita
>>>>>>>>>> Email: nikita@smartshore.nl
>>>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>>>> a "Smartshore" Company
>>>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>>>> http://www.smartshore.nl
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks and Regards,
>>>>>>>> Nikita
>>>>>>>> Email: nikita@smartshore.nl
>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>> a "Smartshore" Company
>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>> http://www.smartshore.nl
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Nikita
>>>>>> Email: nikita@smartshore.nl
>>>>>> United Sources Service Pvt. Ltd.
>>>>>> a "Smartshore" Company
>>>>>> Mobile: +91 99 888 57720
>>>>>> http://www.smartshore.nl
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks and Regards,
>>>> Nikita
>>>> Email: nikita@smartshore.nl
>>>> United Sources Service Pvt. Ltd.
>>>> a "Smartshore" Company
>>>> Mobile: +91 99 888 57720
>>>> http://www.smartshore.nl
>>>>
>>>
>>
>>
>> --
>> Thanks and Regards,
>> Nikita
>> Email: nikita@smartshore.nl
>> United Sources Service Pvt. Ltd.
>> a "Smartshore" Company
>> Mobile: +91 99 888 57720
>> http://www.smartshore.nl
>>
>


-- 
Thanks and Regards,
Nikita
Email: nikita@smartshore.nl
United Sources Service Pvt. Ltd.
a "Smartshore" Company
Mobile: +91 99 888 57720
http://www.smartshore.nl

Mime
View raw message