tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tucker Barbour (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2875) Support Google Takeout MBOX format for GChat Messages
Date Wed, 15 May 2019 18:51:00 GMT
Tucker Barbour created TIKA-2875:

             Summary: Support Google Takeout MBOX format for GChat Messages
                 Key: TIKA-2875
                 URL: https://issues.apache.org/jira/browse/TIKA-2875
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.20
         Environment: java version "1.8.0_181"

Java(TM) SE Runtime Environment (build 1.8.0_181-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
            Reporter: Tucker Barbour
         Attachments: Sample.mbox

The [Google Takeout|https://takeout.google.com] tool allows a user to export Gmail and GChat
messages as an MBOX archive. Tika's content type detection properly asserts this format as
MBOX. However, the provided MBOX parser does not seem to support the format of the `From`
 header for GChat messages. I've included an example chat in the ticket. You can see the
format of the From header also includes a from address and the sent timestamp. As I understand
this is a valid From header format. I would expect the Tika MBOX parser to properly parse
the From header and set the sent time as the value parsed from the From header format in the
provided example.

This message was sent by Atlassian JIRA

View raw message