manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Pablo Diaz-Vaz <jpdiaz...@mcplusa.com>
Subject Re: Amazon CloudSearch Connector question
Date Mon, 08 Feb 2016 21:18:47 GMT
Thanks,

Don't know if it'll help, but removing the usage of JSONObjectReader on
addOrReplaceDocumentWithException and posting to Amazon chunk-by-chunk
instead of using the JSONArrayReader on flushDocuments, changed the error I
was getting from Amazon.

Maybe those objects are failing on parsing the content to JSON.

On Mon, Feb 8, 2016 at 6:04 PM, Karl Wright <daddywri@gmail.com> wrote:

> Ok, I'm debugging away, and I can confirm that no data is getting
> through.  I'll have to open a ticket and create a patch when I find the
> problem.
>
> Karl
>
>
> On Mon, Feb 8, 2016 at 3:15 PM, Juan Pablo Diaz-Vaz <jpdiazvaz@mcplusa.com
> > wrote:
>
>> Thank you very much.
>>
>> On Mon, Feb 8, 2016 at 5:13 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Ok, thanks, this is helpful -- it clearly sounds like Amazon is unhappy
>>> about the JSON format we are sending it.  The deprecation message is
>>> probably a strong clue.  I'll experiment here with logging document
>>> contents so that I can give you further advice.  Stay tuned.
>>>
>>> Karl
>>>
>>>
>>> On Mon, Feb 8, 2016 at 3:07 PM, Juan Pablo Diaz-Vaz <
>>> jpdiazvaz@mcplusa.com> wrote:
>>>
>>>> I'm actually not seeing anything on Amazon. The CloudSearch connector
>>>> fails when sending the request to amazon cloudsearch:
>>>>
>>>> AmazonCloudSearch: Error sending document chunk 0: '{"status": "error",
>>>> "errors": [{"message": "[*Deprecated*: Use the outer message field]
>>>> Encountered unexpected end of file"}], "adds": 0, "__type":
>>>> "#DocumentServiceException", "message": "{ [\"Encountered unexpected end
of
>>>> file\"] }", "deletes": 0}'
>>>>
>>>> ERROR 2016-02-08 20:04:16,544 (Job notification thread) -
>>>>
>>>>
>>>>
>>>> On Mon, Feb 8, 2016 at 5:00 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> If you can possibly include a snippet of the JSON you are seeing on
>>>>> the Amazon end, that would be great.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 2:45 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> More likely this is a bug.
>>>>>>
>>>>>> I take it that it is the body string that is not coming out,
>>>>>> correct?  Do all the other JSON fields look reasonable?  Does the
body
>>>>>> clause exist and is just empty, or is it not there at all?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 2:36 PM, Juan Pablo Diaz-Vaz <
>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> When running a copy of the job, but with SOLR as a target, I'm
>>>>>>> seeing the expected content being posted to SOLR, so it may not
be an issue
>>>>>>> with TIKA. After adding some more logging to the CloudSearch
connector, I
>>>>>>> think the data is getting lost just before passing it to the
>>>>>>> DocumentChunkManager, which inserts the empty records to the
DB. Could it
>>>>>>> be that the JSONObjectReader doesn't like my data?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 3:48 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Juan,
>>>>>>>>
>>>>>>>> I'd try to reproduce as much of the pipeline as possible
using a
>>>>>>>> solr output connection.  If you include the tika extractor
in the pipeline,
>>>>>>>> you will want to configure the solr connection to not use
the extracting
>>>>>>>> update handler.  There's a checkbox on the Schema tab you
need to uncheck
>>>>>>>> for that.  But if you do that you can see what is being sent
to Solr pretty
>>>>>>>> exactly; it all gets logged in the INFO messages dumped to
solr log.  This
>>>>>>>> should help you figure out if the problem is your tika configuration
or not.
>>>>>>>>
>>>>>>>> Please give this a try and let me know what happens.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 1:28 PM, Juan Pablo Diaz-Vaz <
>>>>>>>> jpdiazvaz@mcplusa.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've successfully sent data to FileSystems and SOLR,
but for
>>>>>>>>> Amazon CloudSearch I'm seeing that only empty messages
are being sent to my
>>>>>>>>> domain. I think this may be an issue on how I've setup
the TIKA Extractor
>>>>>>>>> Transformation or the field mapping. I think the Database
where the records
>>>>>>>>> are supposed to be stored before flushing to Amazon,
is storing empty
>>>>>>>>> content.
>>>>>>>>>
>>>>>>>>> I've tried to find documentation on how to setup the
TIKA
>>>>>>>>> Transformation, but I haven't been able to find any.
>>>>>>>>>
>>>>>>>>> If someone could provide an example of a job setup to
send from a
>>>>>>>>> FileSystem to CloudSearch, that'd be great!
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>>>> +56 9 84265890
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Juan Pablo Diaz-Vaz Varas,
>>>>>>> Full Stack Developer - MC+A Chile
>>>>>>> +56 9 84265890
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Juan Pablo Diaz-Vaz Varas,
>>>> Full Stack Developer - MC+A Chile
>>>> +56 9 84265890
>>>>
>>>
>>>
>>
>>
>> --
>> Juan Pablo Diaz-Vaz Varas,
>> Full Stack Developer - MC+A Chile
>> +56 9 84265890
>>
>
>


-- 
Juan Pablo Diaz-Vaz Varas,
Full Stack Developer - MC+A Chile
+56 9 84265890

Mime
View raw message