manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Exception in the running Custom Job
Date Tue, 28 Aug 2018 13:18:32 GMT
Can you add logging messages to your connector to log (1) the length that
it sees, and (2) the result of checkLengthIndexable()?  And then, please
once again add the Allowed Documents transformer and set a reasonable
document length.  Run the job and see why it is rejecting your documents.

All of our shipping connectors use this logic and it does work, so I am
rather certain that the problem is in your connector.

Thanks,
Karl


On Tue, Aug 28, 2018 at 8:54 AM Nikita Ahuja <nikita@smartshore.nl> wrote:

> Hi Karl,
>
> Thank you for valuable suggestion.
>
> The checkLengthIndexable() value is also used in the code and it is
> returning the exact value for document length.
>
> Also garbage collector and disposing for the threads is used.
>
>
>
> On Tue, Aug 28, 2018 at 5:44 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> I don't see checkLengthIndexable() in this list.  You need to add that if
>> you want your connector to be able to not try and index documents that are
>> too big.
>>
>> You said before that when you added the Allowed Documents transformer to
>> the chain it removed ALL documents, so I suspect it's there but you are not
>> sending in the actual document length.
>>
>> Karl
>>
>>
>>
>>
>> On Tue, Aug 28, 2018 at 8:10 AM Nikita Ahuja <nikita@smartshore.nl>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> These methods are already in use with the connector in the code where
>>> file is need to read and ingest in the output.
>>>
>>> (!activities.checkURLIndexable(fileUrl))
>>> (!activities.checkMimeTypeIndexable(contentType))
>>> (!activities.checkDateIndexable(modifiedDate))
>>>
>>>
>>> But this service crashes after crawling approx 2000 documents.
>>>
>>> I think there is some other thing hitting it and creating problem.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Aug 24, 2018 at 8:33 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Nikita,
>>>>
>>>> Until you fix your connector, nothing can be done to address your Out
>>>> Of Memory problem.
>>>>
>>>> The problem is that you are not calling the following IProcessActivity
>>>> method:
>>>>
>>>>   /** Check whether a document of a specific length is indexable by the
>>>> currently specified output connector.
>>>>   *@param length is the document length.
>>>>   *@return true if the document is indexable.
>>>>   */
>>>>   public boolean checkLengthIndexable(long length)
>>>>     throws ManifoldCFException, ServiceInterruption;
>>>>
>>>> Your connector should call this and honor the response.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Fri, Aug 24, 2018 at 9:55 AM Nikita Ahuja <nikita@smartshore.nl>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> I have checked for the coding error, there is nothing like that
>>>>> as"Allowed Document" is working fine for same code on the other system.
>>>>>
>>>>> But now main issue being faced is "Shutting down of the ManifoldCF"
>>>>> and it shows *"java.lang.OutOfMemoryError: GC overhead limit
>>>>> exceeded" on the system.*
>>>>>
>>>>> Postgresql is being used for Manifoldcf and the memory alloted for the
>>>>> system is very good, but still this issue is faced very frequently.
>>>>> Throttling(2) and Worker thread size"45" is also being checked and as
>>>>> per the documentation it is checked for different values.
>>>>>
>>>>>
>>>>> Please suggest the possible problem area and steps to be taken.
>>>>>
>>>>> On Mon, Aug 20, 2018 at 7:30 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Obviously your Allowed Documents filter is somehow causing all
>>>>>> documents to be excluded.  Since you have a custom repository connector
I
>>>>>> would bet there is a coding error in it that is responsible.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 20, 2018 at 8:49 AM Nikita Ahuja <nikita@smartshore.nl>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> Thanks for reply.
>>>>>>>
>>>>>>> I am using in the same sequence. The allowed document is added
first
>>>>>>> and then the Tika Transformation.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> But nothing runs in that scenario. The job simply ends without
>>>>>>> returning anything in the output.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 20, 2018 at 5:36 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> You are running out of memory.
>>>>>>>> Tika's memory consumption is not well defined so you will
need to
>>>>>>>> limit the size of documents that reach it.  This is not the
same as
>>>>>>>> limiting the size of documents *after* Tika extracts them.
>>>>>>>>
>>>>>>>> The Allowed Documents transformer therefore should be placed
in the
>>>>>>>> pipeline before the Tika Extractor.
>>>>>>>>
>>>>>>>> "Also it is not compatible with the Allowed Documents and
Metadata
>>>>>>>> Adjuster Connectors."
>>>>>>>>
>>>>>>>> This is a huge red flag.  Why not?
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 20, 2018 at 6:47 AM Nikita Ahuja <nikita@smartshore.nl>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> There is a custom job executing for Aconex in the ManifoldCF
>>>>>>>>> environment. But while executing it is not able to crawl
complete set of
>>>>>>>>> documents. It crashes in the middle of the execution.
>>>>>>>>>
>>>>>>>>> Also it is not compatible with the Allowed Documents
and Metadata
>>>>>>>>> Adjuster Connectors.
>>>>>>>>>
>>>>>>>>> The custom job created is similar to the existing Jira
connector
>>>>>>>>> in the ManifoldCF.
>>>>>>>>>
>>>>>>>>> And it showing this type of error. Please suggest appropriate
>>>>>>>>> steps which needs to be followed to make it smoothly
running.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>>>>>>>> [uk1.aconex.co.uk/---.---.---.---
>>>>>>>>> <http://uk1.aconex.co.uk/---.---.---.--->] failed:
Read timed out*
>>>>>>>>> *agents process ran out of memory - shutting down*
>>>>>>>>> *agents process ran out of memory - shutting down*
>>>>>>>>> *agents process ran out of memory - shutting down*
>>>>>>>>> *agents process ran out of memory - shutting down*
>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.core.database.Database.beginTransaction(Database.java:240)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1361)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1327)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.crawler.jobs.JobManager.assessMarkedJobs(JobManager.java:823)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.crawler.system.AssessmentThread.run(AssessmentThread.java:65)*
>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.clone(PDGraphicsState.java:494)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.saveGraphicsState(PDFStreamEngine.java:898)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:721)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:587)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:55)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:168)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexSession.fetchAndIndexFile(AconexSession.java:720)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexRepositoryConnector.processDocuments(AconexRepositoryConnector.java:1194)*
>>>>>>>>> *        at
>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)*
>>>>>>>>> *[Thread-431] INFO org.eclipse.jetty.server.ServerConnector
-
>>>>>>>>> Stopped ServerConnector@2c0b4c83{HTTP/1.1}{0.0.0.0:8345
>>>>>>>>> <http://0.0.0.0:8345>}*
>>>>>>>>> *[Thread-431] INFO org.eclipse.jetty.server.handler.ContextHandler
>>>>>>>>> - Stopped
>>>>>>>>> o.e.j.w.WebAppContext@4c03a37{/mcf-api-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-api-service.war-_mcf-api-service-any-3117653580650249372.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-api-service.war}*
>>>>>>>>> *[Thread-431] INFO org.eclipse.jetty.server.handler.ContextHandler
>>>>>>>>> - Stopped
>>>>>>>>> o.e.j.w.WebAppContext@65ae095c{/mcf-authority-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-authority-service.war-_mcf-authority-service-any-8288503227579256193.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-authority-service.war}*
>>>>>>>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>>>>>>>> [uk1.aconex.co.uk/23.10.35.84 <http://uk1.aconex.co.uk/23.10.35.84>]
>>>>>>>>> failed: Read timed out*
>>>>>>>>> --
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Nikita
>>>>>>>>> Email: nikita@smartshore.nl
>>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>>> a "Smartshore" Company
>>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>>> http://www.smartshore.nl
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks and Regards,
>>>>>>> Nikita
>>>>>>> Email: nikita@smartshore.nl
>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>> a "Smartshore" Company
>>>>>>> Mobile: +91 99 888 57720
>>>>>>> http://www.smartshore.nl
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards,
>>>>> Nikita
>>>>> Email: nikita@smartshore.nl
>>>>> United Sources Service Pvt. Ltd.
>>>>> a "Smartshore" Company
>>>>> Mobile: +91 99 888 57720
>>>>> http://www.smartshore.nl
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Nikita
>>> Email: nikita@smartshore.nl
>>> United Sources Service Pvt. Ltd.
>>> a "Smartshore" Company
>>> Mobile: +91 99 888 57720
>>> http://www.smartshore.nl
>>>
>>
>
>
> --
> Thanks and Regards,
> Nikita
> Email: nikita@smartshore.nl
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

Mime
View raw message