manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Amazon CloudSearch Connector question
Date Mon, 08 Feb 2016 21:27:45 GMT
I have chased this down to a completely broken Apache Commons-IO library.
It no longer works with the JSONReader objects in ManifoldCF at all, and
refuses to read anything from them.  Unfortunately I can't change versions
of that library because other things depend upon it. So I'll need to write
my own code to replace its functionality.  That will take some amount of
time to do.

This probably happened the last time our dependencies were updated.  My
apologies.

Karl


On Mon, Feb 8, 2016 at 4:18 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com>
wrote:

> Thanks,
>
> Don't know if it'll help, but removing the usage of JSONObjectReader on
> addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
> instead of using the JSONArrayReader on flushDocuments, changed the error I
> was getting from Amazon.
>
> Maybe those objects are failing on parsing the content to JSON.
>
> On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Ok, I'm debugging away, and I can confirm that no data is getting
>> through.  I'll have to open a ticket and create a patch when I find the
>> problem.
>>
>> Karl
>>
>>
>> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <
>> jpdiazvaz@mcplusa.com> wrote:
>>
>>> Thank you very much.
>>>
>>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is unhappy
>>>> about the JSON format we are sending it.  The deprecation message is
>>>> probably a strong clue.  I'll experiment here with logging document
>>>> contents so that I can give you further advice.  Stay tuned.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>
>>>>> I'm actually not seeing anything on Amazon. The CloudSearch connector
>>>>> fails when sending the request to amazon cloudsearch:
>>>>>
>>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status":
>>>>> "error", "errors": [{"message": "[*Deprecated*: Use the outer message
>>>>> field] Encountered unexpected end of file"}], "adds": 0, "__type":
>>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected
end of
>>>>> file\"] }", "deletes": 0}'
>>>>>
>>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> If you can possibly include a snippet of the JSON you are seeing
on
>>>>>> the Amazon end, that would be great.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> More likely this is a bug.
>>>>>>>
>>>>>>> I take it that it is the body string that is not coming out,
>>>>>>> correct?  Do all the other JSON fields look reasonable?  Does
the body
>>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> When running a copy of the job, but with SOLR as a target,
I'm
>>>>>>>> seeing the expected content being posted to SOLR, so it may
not be an issue
>>>>>>>> with TIKA. After adding some more logging to the CloudSearch
connector, I
>>>>>>>> think the data is getting lost just before passing it to
the
>>>>>>>> DocumentChunkManager, which inserts the empty records to
the DB. Could it
>>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Juan,
>>>>>>>>>
>>>>>>>>> I'd try to reproduce as much of the pipeline as possible
using a
>>>>>>>>> solr output connection.  If you include the tika extractor
in the pipeline,
>>>>>>>>> you will want to configure the solr connection to not
use the extracting
>>>>>>>>> update handler.  There's a checkbox on the Schema tab
you need to uncheck
>>>>>>>>> for that.  But if you do that you can see what is being
sent to Solr pretty
>>>>>>>>> exactly; it all gets logged in the INFO messages dumped
to solr log.  This
>>>>>>>>> should help you figure out if the problem is your tika
configuration or not.
>>>>>>>>>
>>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I've successfully sent data to FileSystems and SOLR,
but for
>>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages
are being sent to my
>>>>>>>>>> domain. I think this may be an issue on how I've
setup the TIKA Extractor
>>>>>>>>>> Transformation or the field mapping. I think the
Database where the records
>>>>>>>>>> are supposed to be stored before flushing to Amazon,
is storing empty
>>>>>>>>>> content.
>>>>>>>>>>
>>>>>>>>>> I've tried to find documentation on how to setup
the TIKA
>>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>>
>>>>>>>>>> If someone could provide an example of a job setup
to send from a
>>>>>>>>>> FileSystem to CloudSearch, that'd be great!
>>>>>>>>>>
>>>>>>>>>> Thanks in advance,
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>>> +56 9 84265890
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>> +56 9 84265890
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>> Full Stack Developer - MC+A Chile
>>>>> +56 9 84265890
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Juan Pablo Diaz-Vaz Varas,
>>> Full Stack Developer - MC+A Chile
>>> +56 9 84265890
>>>
>>
>>
>
>
> --
> Juan Pablo Diaz-Vaz Varas,
> Full Stack Developer - MC+A Chile
> +56 9 84265890
>

Mime
View raw message