manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Exception in the running Custom Job
Date Wed, 29 Aug 2018 08:14:21 GMT
So the Allowed Document transformer is now working, and your connector is
now skipping documents that are too large, correct?  But you are still
seeing out of memory errors?

Does your connector load the entire document into memory before it calls
checkLengthIndexable()?  Because if it does, that will not work.  There is
a reason that connectors are constructed to stream data in MCF.

It might be faster to diagnose your problem if you made the source code
available so that I could audit it.

Karl


On Wed, Aug 29, 2018 at 2:42 AM Nikita Ahuja <nikita@smartshore.nl> wrote:

> Hi Karl,
>
> The result for both the Length and checkLengthIndexable() method is same.
> And the Allowed  Document is also working. But main problem is crashing
> down of the service and it displays memory Leakage error every time after
> crawling few set of documents..
>
>
>
> On Tue, Aug 28, 2018 at 6:48 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Can you add logging messages to your connector to log (1) the length that
>> it sees, and (2) the result of checkLengthIndexable()?  And then, please
>> once again add the Allowed Documents transformer and set a reasonable
>> document length.  Run the job and see why it is rejecting your documents.
>>
>> All of our shipping connectors use this logic and it does work, so I am
>> rather certain that the problem is in your connector.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Aug 28, 2018 at 8:54 AM Nikita Ahuja <nikita@smartshore.nl>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> Thank you for valuable suggestion.
>>>
>>> The checkLengthIndexable() value is also used in the code and it is
>>> returning the exact value for document length.
>>>
>>> Also garbage collector and disposing for the threads is used.
>>>
>>>
>>>
>>> On Tue, Aug 28, 2018 at 5:44 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> I don't see checkLengthIndexable() in this list.  You need to add that
>>>> if you want your connector to be able to not try and index documents that
>>>> are too big.
>>>>
>>>> You said before that when you added the Allowed Documents transformer
>>>> to the chain it removed ALL documents, so I suspect it's there but you are
>>>> not sending in the actual document length.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2018 at 8:10 AM Nikita Ahuja <nikita@smartshore.nl>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> These methods are already in use with the connector in the code where
>>>>> file is need to read and ingest in the output.
>>>>>
>>>>> (!activities.checkURLIndexable(fileUrl))
>>>>> (!activities.checkMimeTypeIndexable(contentType))
>>>>> (!activities.checkDateIndexable(modifiedDate))
>>>>>
>>>>>
>>>>> But this service crashes after crawling approx 2000 documents.
>>>>>
>>>>> I think there is some other thing hitting it and creating problem.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 24, 2018 at 8:33 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Nikita,
>>>>>>
>>>>>> Until you fix your connector, nothing can be done to address your
Out
>>>>>> Of Memory problem.
>>>>>>
>>>>>> The problem is that you are not calling the following
>>>>>> IProcessActivity method:
>>>>>>
>>>>>>   /** Check whether a document of a specific length is indexable
by
>>>>>> the currently specified output connector.
>>>>>>   *@param length is the document length.
>>>>>>   *@return true if the document is indexable.
>>>>>>   */
>>>>>>   public boolean checkLengthIndexable(long length)
>>>>>>     throws ManifoldCFException, ServiceInterruption;
>>>>>>
>>>>>> Your connector should call this and honor the response.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 24, 2018 at 9:55 AM Nikita Ahuja <nikita@smartshore.nl>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> I have checked for the coding error, there is nothing like that
>>>>>>> as"Allowed Document" is working fine for same code on the other
system.
>>>>>>>
>>>>>>> But now main issue being faced is "Shutting down of the ManifoldCF"
>>>>>>> and it shows *"java.lang.OutOfMemoryError: GC overhead limit
>>>>>>> exceeded" on the system.*
>>>>>>>
>>>>>>> Postgresql is being used for Manifoldcf and the memory alloted
for
>>>>>>> the system is very good, but still this issue is faced very frequently.
>>>>>>> Throttling(2) and Worker thread size"45" is also being checked
and
>>>>>>> as per the documentation it is checked for different values.
>>>>>>>
>>>>>>>
>>>>>>> Please suggest the possible problem area and steps to be taken.
>>>>>>>
>>>>>>> On Mon, Aug 20, 2018 at 7:30 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Obviously your Allowed Documents filter is somehow causing
all
>>>>>>>> documents to be excluded.  Since you have a custom repository
connector I
>>>>>>>> would bet there is a coding error in it that is responsible.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 20, 2018 at 8:49 AM Nikita Ahuja <nikita@smartshore.nl>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> Thanks for reply.
>>>>>>>>>
>>>>>>>>> I am using in the same sequence. The allowed document
is added
>>>>>>>>> first and then the Tika Transformation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> But nothing runs in that scenario. The job simply ends
without
>>>>>>>>> returning anything in the output.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Aug 20, 2018 at 5:36 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> You are running out of memory.
>>>>>>>>>> Tika's memory consumption is not well defined so
you will need to
>>>>>>>>>> limit the size of documents that reach it.  This
is not the same as
>>>>>>>>>> limiting the size of documents *after* Tika extracts
them.
>>>>>>>>>>
>>>>>>>>>> The Allowed Documents transformer therefore should
be placed in
>>>>>>>>>> the pipeline before the Tika Extractor.
>>>>>>>>>>
>>>>>>>>>> "Also it is not compatible with the Allowed Documents
and
>>>>>>>>>> Metadata Adjuster Connectors."
>>>>>>>>>>
>>>>>>>>>> This is a huge red flag.  Why not?
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 20, 2018 at 6:47 AM Nikita Ahuja <
>>>>>>>>>> nikita@smartshore.nl> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>
>>>>>>>>>>> There is a custom job executing for Aconex in
the ManifoldCF
>>>>>>>>>>> environment. But while executing it is not able
to crawl complete set of
>>>>>>>>>>> documents. It crashes in the middle of the execution.
>>>>>>>>>>>
>>>>>>>>>>> Also it is not compatible with the Allowed Documents
and
>>>>>>>>>>> Metadata Adjuster Connectors.
>>>>>>>>>>>
>>>>>>>>>>> The custom job created is similar to the existing
Jira connector
>>>>>>>>>>> in the ManifoldCF.
>>>>>>>>>>>
>>>>>>>>>>> And it showing this type of error. Please suggest
appropriate
>>>>>>>>>>> steps which needs to be followed to make it smoothly
running.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>>>>>>>>>> [uk1.aconex.co.uk/---.---.---.---
>>>>>>>>>>> <http://uk1.aconex.co.uk/---.---.---.--->]
failed: Read timed out*
>>>>>>>>>>> *agents process ran out of memory - shutting
down*
>>>>>>>>>>> *agents process ran out of memory - shutting
down*
>>>>>>>>>>> *agents process ran out of memory - shutting
down*
>>>>>>>>>>> *agents process ran out of memory - shutting
down*
>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.core.database.Database.beginTransaction(Database.java:240)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1361)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1327)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.crawler.jobs.JobManager.assessMarkedJobs(JobManager.java:823)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.crawler.system.AssessmentThread.run(AssessmentThread.java:65)*
>>>>>>>>>>> *java.lang.OutOfMemoryError: Java heap space*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.clone(PDGraphicsState.java:494)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.saveGraphicsState(PDFStreamEngine.java:898)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:721)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:587)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:55)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:168)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexSession.fetchAndIndexFile(AconexSession.java:720)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexRepositoryConnector.processDocuments(AconexRepositoryConnector.java:1194)*
>>>>>>>>>>> *        at
>>>>>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)*
>>>>>>>>>>> *[Thread-431] INFO org.eclipse.jetty.server.ServerConnector
-
>>>>>>>>>>> Stopped ServerConnector@2c0b4c83{HTTP/1.1}{0.0.0.0:8345
>>>>>>>>>>> <http://0.0.0.0:8345>}*
>>>>>>>>>>> *[Thread-431] INFO
>>>>>>>>>>> org.eclipse.jetty.server.handler.ContextHandler
- Stopped
>>>>>>>>>>> o.e.j.w.WebAppContext@4c03a37{/mcf-api-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-api-service.war-_mcf-api-service-any-3117653580650249372.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-api-service.war}*
>>>>>>>>>>> *[Thread-431] INFO
>>>>>>>>>>> org.eclipse.jetty.server.handler.ContextHandler
- Stopped
>>>>>>>>>>> o.e.j.w.WebAppContext@65ae095c{/mcf-authority-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-authority-service.war-_mcf-authority-service-any-8288503227579256193.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-authority-service.war}*
>>>>>>>>>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>>>>>>>>>> [uk1.aconex.co.uk/23.10.35.84 <http://uk1.aconex.co.uk/23.10.35.84>]
>>>>>>>>>>> failed: Read timed out*
>>>>>>>>>>> --
>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>> Nikita
>>>>>>>>>>> Email: nikita@smartshore.nl
>>>>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>>>>> a "Smartshore" Company
>>>>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>>>>> http://www.smartshore.nl
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Nikita
>>>>>>>>> Email: nikita@smartshore.nl
>>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>>> a "Smartshore" Company
>>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>>> http://www.smartshore.nl
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks and Regards,
>>>>>>> Nikita
>>>>>>> Email: nikita@smartshore.nl
>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>> a "Smartshore" Company
>>>>>>> Mobile: +91 99 888 57720
>>>>>>> http://www.smartshore.nl
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards,
>>>>> Nikita
>>>>> Email: nikita@smartshore.nl
>>>>> United Sources Service Pvt. Ltd.
>>>>> a "Smartshore" Company
>>>>> Mobile: +91 99 888 57720
>>>>> http://www.smartshore.nl
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Nikita
>>> Email: nikita@smartshore.nl
>>> United Sources Service Pvt. Ltd.
>>> a "Smartshore" Company
>>> Mobile: +91 99 888 57720
>>> http://www.smartshore.nl
>>>
>>
>
>
> --
> Thanks and Regards,
> Nikita
> Email: nikita@smartshore.nl
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

Mime
View raw message