manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: extract email attachment
Date Tue, 07 Feb 2017 23:10:55 GMT
Hi Cihad,

You need to set an attachment URL template for the attachments to be
crawled.  Open your email connection and click the "URL" tab, and you will
see the new field there.

Karl


On Tue, Feb 7, 2017 at 6:07 PM, Cihad Guzel <cguzelg@gmail.com> wrote:

> Hi Karl,
>
> Does not 'else' part has to be proccessed when the email has an
> attachment?
> Although the email has an attachment, only the first part was processed.
> Also, I don't see the attachment's content in solr index.
>
> I edited the code line for testing as follow:
>
>  if (attachmentIndex == null) {
>           // It's an email
>           System.out.println("running if block");
> ...
>         } else {
>           System.out.println("running else block");
>           // It's an attachment
>           attachmentNumber = attachmentIndex;
> ...
>         }
>
> Then, I run my job. It processed 3 times. The log looks as like:
>
> ...
> running if block
> running if block
> running if block
> ...
>
>
> The solr response:
>
> {
>         "subject":["pdf test page"],
>         "from":["Cihad Guzel <cguzelg@gmail.com>"],
>         "id":"http://sampleserver/%C4%B0%C5%9F%2FmyFolder%2Ftest?id=%
> 3CCADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw%40mail.gmail.com%3E
> ",
>         "date":["Tue Feb 07 20:37:35 MSK 2017"],
>         "mimetype":["",
>           ""],
>         "created_date":"2017-02-07T17:37:35.000Z",
>         "indexed_date":"2017-02-07T21:18:05.382Z",
>         "to":["Cihad Guzel <cguzelg@gmail.com>"],
>         "modified_date":"2017-02-07T17:37:35.000Z",
>         "encoding":["",
>           ""],
>         "mime_type":"text/plain",
>         "stream_size":["null"],
>         "x_parsed_by":["org.apache.tika.parser.DefaultParser",
>           "org.apache.tika.parser.txt.TXTParser"],
>         "stream_content_type":["text/plain"],
>         "content_encoding":["windows-1252"],
>         "content_type":["text/plain; charset=windows-1252"],
>         "content":" \n \n  \n  \n  \n  \n  \n  \n  \n \n  --
> 94eb2c1910841bc55f0547f43443\r\nContent-Type: multipart/alternative;
> boundary=94eb2c1910841bc5530547f43441\r\n\r\n--
> 94eb2c1910841bc5530547f43441\r\nContent-Type: text/plain;
> charset=UTF-8\r\n\r\nthis is test mail for mfc.\r\n\r\n--
> 94eb2c1910841bc5530547f43441\r\nContent-Type: text/html;
> charset=UTF-8\r\n\r\n<div dir=\"ltr\">this is test mail for
> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
> 94eb2c1910841bc55f0547f43443\r\nContent-Type: application/pdf;
> name=\"pdf-test.pdf\"\r\nContent-Disposition: attachment;
> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\
> nJVBERi0xLjYNJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9... ",
>         "language":"en",
>         "_version_":1558710621053124608}]
>   }
>
>
>
> 2017-02-08 1:17 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>
>> Here's the full code for this class:
>>
>> https://svn.apache.org/repos/asf/manifoldcf/trunk/connectors
>> /email/connector/src/main/java/org/apache/manifoldcf/
>> crawler/connectors/email/EmailConnector.java
>>
>> Karl
>>
>>
>> On Tue, Feb 7, 2017 at 5:14 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Cihad,
>>>
>>> The variable attachmentIndex is *supposed* to be null except when an
>>> attachment is being processed.  The code should look like this:
>>>
>>>         if (attachmentIndex == null) {
>>>           // It's an email
>>> ...
>>>         } else {
>>>           // It's an attachment
>>>           attachmentNumber = attachmentIndex;
>>> ...
>>>         }
>>>
>>>
>>> Karl
>>>
>>>
>>> On Tue, Feb 7, 2017 at 4:43 PM, Cihad Guzel <cguzelg@gmail.com> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> I added LOG line for testing. It looks attachmentIndex is null.
>>>>
>>>> 2017-02-08 0:11 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>
>>>>> I attached a second patch (to apply on top of the first patch).
>>>>> Please let me know if that fixes the issue.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 7, 2017 at 3:59 PM, Cihad Guzel <cguzelg@gmail.com>
wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> I have an error as follow:
>>>>>>
>>>>>> FATAL 2017-02-07 23:56:09,483 (Worker thread '29') - Error tossed:
>>>>>> For input string: "myFolder/test:<CADNgPDgSXHeWo
>>>>>> 0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>>> java.lang.NumberFormatException: For input string: "myFolder/test:<
>>>>>> CADNgPDgSXHeWo0GDnUL6S2sogUsXUa9mx2WxOT23Wi37Hog5Gw@mail.gmail.com>"
>>>>>>         at java.lang.NumberFormatExceptio
>>>>>> n.forInputString(NumberFormatException.java:65)
>>>>>>         at java.lang.Integer.parseInt(Integer.java:580)
>>>>>>         at java.lang.Integer.parseInt(Integer.java:615)
>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>> connectors.email.EmailConnector.processDocuments(EmailConnec
>>>>>> tor.java:705)
>>>>>>         at org.apache.manifoldcf.crawler.
>>>>>> system.WorkerThread.run(WorkerThread.java:399)
>>>>>>
>>>>>>
>>>>>> 2017-02-07 22:50 GMT+03:00 Cihad Guzel <cguzelg@gmail.com>:
>>>>>>
>>>>>>> Thanks Karl,
>>>>>>>
>>>>>>> I will try it.
>>>>>>>
>>>>>>> Regards
>>>>>>> Cihad Guzel
>>>>>>>
>>>>>>> 2017-02-07 22:36 GMT+03:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>
>>>>>>>> I've created a ticket and attached a patch to it.
>>>>>>>> CONNECTORS-1375.  Please let me know if it works for you;
if not, I'll fix
>>>>>>>> what doesn't work.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 7, 2017 at 1:19 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Correction: the only metadata attribute we set is the
>>>>>>>>> attachment(s) mimetype (as a multivalued field) -- this
doesn't currently
>>>>>>>>> include the attachment data.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 7, 2017 at 1:14 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Cihad,
>>>>>>>>>>
>>>>>>>>>> The email connector is providing the attachment data
unextracted
>>>>>>>>>> to the output connector as metadata attribute data.
 There are no
>>>>>>>>>> transformation connectors that look at this metadata.
 Solr cell also
>>>>>>>>>> probably does not handle binary in random metadata
attributes the proper
>>>>>>>>>> way.
>>>>>>>>>>
>>>>>>>>>> The connector's attachment code therefore seems to
be designed
>>>>>>>>>> only to deal with textual attachments.  The right
solution is to have
>>>>>>>>>> individual IDs for each attachment.  But that would
also require there to
>>>>>>>>>> be a URL we could construct for each attachment.
 We could provide an
>>>>>>>>>> additional URI template for attachments, but I'd
wonder if your system has
>>>>>>>>>> the ability to serve attachments by their own URLs.
 Please let me know if
>>>>>>>>>> this would work and if so I can create a ticket and
work on making these
>>>>>>>>>> changes.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 7, 2017 at 12:56 PM, Cihad Guzel <cguzelg@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I try the email connector with gmail. I attach
the file [1] in
>>>>>>>>>>> my new email. And sent to my test email adress.
>>>>>>>>>>>
>>>>>>>>>>> My mail content body is like: "this is test mail
for mfc"
>>>>>>>>>>>
>>>>>>>>>>> Then I run my email job and the email is indexed
to Solr
>>>>>>>>>>> successfully. But, the solr's content field have
not my attachment's
>>>>>>>>>>> content body. Solr content filed looks like:
>>>>>>>>>>>
>>>>>>>>>>> "content":" \n \n  \n  \n  \n  \n  \n  \n  \n
\n
>>>>>>>>>>>  --94eb2c1910841bc55f0547f43443\r\nContent-Type:
>>>>>>>>>>> multipart/alternative; boundary=94eb2c1910841bc553054
>>>>>>>>>>> 7f43441\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>> text/plain; charset=UTF-8\r\n\r\nthis is test
mail for
>>>>>>>>>>> mfc.\r\n\r\n--94eb2c1910841bc5530547f43441\r\nContent-Type:
>>>>>>>>>>> text/html; charset=UTF-8\r\n\r\n<div dir=\"ltr\">this
is test mail for
>>>>>>>>>>> mfc.\r\n</div>\r\n\r\n--94eb2c1910841bc5530547f43441--\r\n--
>>>>>>>>>>> 94eb2c1910841bc55f0547f43443\r\nContent-Type:
application/pdf;
>>>>>>>>>>> name=\"pdf-test.pdf\"\r\nContent-Disposition:
attachment;
>>>>>>>>>>> filename=\"pdf-test.pdf\"\r\nContent-Transfer-Encoding:
>>>>>>>>>>> base64\r\nX-Attachment-Id: f_iyvt78qa0\r\n\r\nJVBERi0xLjY
>>>>>>>>>>> NJeLjz9MNCjM3IDAgb2JqIDw8L0xpbmVhcml6ZWQgMS9MIDIwNTk3L08gNDA
>>>>>>>>>>> vRSAx\r\nNDExNS9OIDEvVCAxOTc5NS9IIFsgMTAwNSAyMTVdPj4NZW5kb2J
>>>>>>>>>>> qDSAgICAgICAgICAgICAgICAg\r\nDQp4cmVmDQozNyAzNA0KMDAwMDAwMDA
>>>>>>>>>>> xNiAwMDAwMCBuDQowMDAwMDAxMzg2IDAwMDAwIG4NCjAw\r\nMDAwMDE1MjIgMDAwM
>>>>>>>>>>> ..."
>>>>>>>>>>>
>>>>>>>>>>> Does the MFC email connector know that the attachment's
file
>>>>>>>>>>> type is pdf? Does not extract the contents?
>>>>>>>>>>>
>>>>>>>>>>> [1] http://www.orimi.com/pdf-test.pdf
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> Cihad Güzel
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Teşekkürler
>>>>>>> Cihad Güzel
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Teşekkürler
>>>>>> Cihad Güzel
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Teşekkürler
>>>> Cihad Güzel
>>>>
>>>
>>>
>>
>
>
> --
> Teşekkürler
> Cihad Güzel
>

Mime
View raw message