tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Wilson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-912) Response charset encoding not declared, and depends on host OS (Windows/Linux)
Date Fri, 04 May 2012 20:58:48 GMT

     [ https://issues.apache.org/jira/browse/TIKA-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Wilson updated TIKA-912:
------------------------------

    Attachment: TikaResource-utf8-response.patch
    
> Response charset encoding not declared, and depends on host OS (Windows/Linux)
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-912
>                 URL: https://issues.apache.org/jira/browse/TIKA-912
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.1
>         Environment: java version "1.6.0_26"
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
> Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode)
> java version "1.6.0_31"
> Java(TM) SE Runtime Environment (build 1.6.0_31-b05)
> Java HotSpot(TM) Client VM (build 20.6-b01, mixed mode, sharing)
>            Reporter: Chris Wilson
>              Labels: newbie, patch
>         Attachments: TikaResource-utf8-response.patch
>
>
> When the response to the /tika servlet contains non-ASCII characters, Tika doesn't tell
us what encoding it's using, and the encoding differs depending on which OS the server is
running on.
> This is a server running on Tomcat on Linux:
> {code}
> chris@lap-x201:~/projects/atamis-intranet/django/intranet$ curl -i -T documents/fixtures/smartquote-bullet.docx
http://localhost:8080/tika/tika | hexdump -C
> 00000000  48 54 54 50 2f 31 2e 31  20 31 30 30 20 43 6f 6e  |HTTP/1.1 100 Con|
> 00000010  74 69 6e 75 65 0d 0a 0d  0a 48 54 54 50 2f 31 2e  |tinue....HTTP/1.|
> 00000020  31 20 32 30 30 20 4f 4b  0d 0a 53 65 72 76 65 72  |1 200 OK..Server|
> 00000030  3a 20 41 70 61 63 68 65  2d 43 6f 79 6f 74 65 2f  |: Apache-Coyote/|
> 00000040  31 2e 31 0d 0a 43 6f 6e  74 65 6e 74 2d 54 79 70  |1.1..Content-Typ|
> 00000050  65 3a 20 74 65 78 74 2f  70 6c 61 69 6e 0d 0a 54  |e: text/plain..T|
> 00000060  72 61 6e 73 66 65 72 2d  45 6e 63 6f 64 69 6e 67  |ransfer-Encoding|
> 00000070  3a 20 63 68 75 6e 6b 65  64 0d 0a 44 61 74 65 3a  |: chunked..Date:|
> 00000080  20 46 72 69 2c 20 30 34  20 4d 61 79 20 32 30 31  | Fri, 04 May 201|
> 00000090  32 20 31 39 3a 34 30 3a  35 34 20 47 4d 54 0d 0a  |2 19:40:54 GMT..|
> 000000a0  0d 0a e2 80 99 0a e2 80  a2 09 0a                 |...........|
> 000000ab
> {code}
> And this is a server running on Tomcat on Windows:
> {code}
> chris@lap-x201:~/projects/atamis-intranet/django/intranet$ curl -i -T documents/fixtures/smartquote-bullet.docx
http://localhost:9080/tika/tika | hexdump -C
> 00000000  48 54 54 50 2f 31 2e 31  20 31 30 30 20 43 6f 6e  |HTTP/1.1 100 Con|
> 00000010  74 69 6e 75 65 0d 0a 0d  0a 48 54 54 50 2f 31 2e  |tinue....HTTP/1.|
> 00000020  31 20 32 30 30 20 4f 4b  0d 0a 53 65 72 76 65 72  |1 200 OK..Server|
> 00000030  3a 20 41 70 61 63 68 65  2d 43 6f 79 6f 74 65 2f  |: Apache-Coyote/|
> 00000040  31 2e 31 0d 0a 43 6f 6e  74 65 6e 74 2d 54 79 70  |1.1..Content-Typ|
> 00000050  65 3a 20 74 65 78 74 2f  70 6c 61 69 6e 0d 0a 54  |e: text/plain..T|
> 00000060  72 61 6e 73 66 65 72 2d  45 6e 63 6f 64 69 6e 67  |ransfer-Encoding|
> 00000070  3a 20 63 68 75 6e 6b 65  64 0d 0a 44 61 74 65 3a  |: chunked..Date:|
> 00000080  20 46 72 69 2c 20 30 34  20 4d 61 79 20 32 30 31  | Fri, 04 May 201|
> 00000090  32 20 31 39 3a 33 39 3a  35 32 20 47 4d 54 0d 0a  |2 19:39:52 GMT..|
> 000000a0  0d 0a 92 0a 95 09 0a                              |.......|
> 000000a7
> {code}
> As you can see, the data (last few bytes) is encoded differently. The Linux server encodes
it as UTF-8, while Windows is using something strange, probably Windows-1252, where 0x92 is
a curly quote and 0x95 is a bullet point.
> A client can't know what encoding the server used, because the Content-Type is just text/plain
with no encoding.
> Ideally I would like it to use UTF-8 always, so that the client doesn't have to do extra
work to decode it. The attached patch does that, and declares it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message