tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mohsen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2724) Tika does not recognize http 3xx error codes when passed fileUrl
Date Wed, 05 Sep 2018 18:15:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mohsen updated TIKA-2724:
-------------------------
    Description: 
When the {{fileUrl}} passed to the Tika server results in a 3xx http status code, Tika happily
returns a 200 response.

*How to reproduce the issue*: Run tika server with {{-enableUnsecureFeatures and -enableFileUrl
options. Then send a fileUrl}} to the server that returns a 300 error code. Here is a sample
curl session:
{code:java}
$ curl -v google.com
* Rebuilt URL to: google.com/
* Trying 216.58.216.142...
* TCP_NODELAY set
* Connected to google.com (216.58.216.142) port 80 (#0)
> GET / HTTP/1.1
> Host: google.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://www.google.com/
< Content-Type: text/html; charset=UTF-8
< Date: Wed, 05 Sep 2018 15:31:51 GMT
< Expires: Fri, 05 Oct 2018 15:31:51 GMT
< Cache-Control: public, max-age=2592000
< Server: gws
< Content-Length: 219
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
<
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
* Connection #0 to host google.com left intact

$ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9998 (#0)
> PUT /rmeta/text HTTP/1.1
> Host: localhost:9998
> User-Agent: curl/7.54.0
> Accept: */*
> fileUrl:http://google.com
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Wed, 05 Sep 2018 15:25:12 GMT
< Transfer-Encoding: chunked
< Server: Jetty(8.y.z-SNAPSHOT)
<
* Connection #0 to host localhost left intact
[{"Content-Encoding":"UTF-8","Content-Type":"text/html; charset\u003dUTF-8","Content-Type-Hint":"text/html;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n
Search Images Maps Play YouTube News Gmail Drive More »\nWeb History | Settings | Sign in\n\n\n
\n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage tools\n\n\n\n\nGoogle offered in: Fran�ais
\n\n\nAdvertising�ProgramsBusiness Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy
- Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]{code}
 

I am using Tika server to pull files from S3 and parse them, but upon a redirect request,
it neither redirects nor returns an error code.

See https://docs.aws.amazon.com/AmazonS3/latest/dev/Redirects.html

 

  was:
When the {{fileUrl}} passed to the Tika server results in a 3xx http status code, Tika happily
returns a 200 response.

*How to reproduce the issue*: Run tika server with {{-enableUnsecureFeatures and -enableFileUrl
options. Then send a fileUrl}} to the server that returns a 300 error code. Here is a sample
curl session:
{code:java}
$ curl -v google.com
* Rebuilt URL to: google.com/
* Trying 216.58.216.142...
* TCP_NODELAY set
* Connected to google.com (216.58.216.142) port 80 (#0)
> GET / HTTP/1.1
> Host: google.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://www.google.com/
< Content-Type: text/html; charset=UTF-8
< Date: Wed, 05 Sep 2018 15:31:51 GMT
< Expires: Fri, 05 Oct 2018 15:31:51 GMT
< Cache-Control: public, max-age=2592000
< Server: gws
< Content-Length: 219
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
<
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
* Connection #0 to host google.com left intact

$ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9998 (#0)
> PUT /rmeta/text HTTP/1.1
> Host: localhost:9998
> User-Agent: curl/7.54.0
> Accept: */*
> fileUrl:http://google.com
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Wed, 05 Sep 2018 15:25:12 GMT
< Transfer-Encoding: chunked
< Server: Jetty(8.y.z-SNAPSHOT)
<
* Connection #0 to host localhost left intact
[{"Content-Encoding":"UTF-8","Content-Type":"text/html; charset\u003dUTF-8","Content-Type-Hint":"text/html;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n
Search Images Maps Play YouTube News Gmail Drive More »\nWeb History | Settings | Sign in\n\n\n
\n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage tools\n\n\n\n\nGoogle offered in: Fran�ais
\n\n\nAdvertising�ProgramsBusiness Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy
- Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]{code}
 


> Tika does not recognize http 3xx error codes when passed fileUrl
> ----------------------------------------------------------------
>
>                 Key: TIKA-2724
>                 URL: https://issues.apache.org/jira/browse/TIKA-2724
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.18
>            Reporter: Mohsen
>            Priority: Major
>
> When the {{fileUrl}} passed to the Tika server results in a 3xx http status code, Tika
happily returns a 200 response.
> *How to reproduce the issue*: Run tika server with {{-enableUnsecureFeatures and -enableFileUrl
options. Then send a fileUrl}} to the server that returns a 300 error code. Here is a sample
curl session:
> {code:java}
> $ curl -v google.com
> * Rebuilt URL to: google.com/
> * Trying 216.58.216.142...
> * TCP_NODELAY set
> * Connected to google.com (216.58.216.142) port 80 (#0)
> > GET / HTTP/1.1
> > Host: google.com
> > User-Agent: curl/7.54.0
> > Accept: */*
> >
> < HTTP/1.1 301 Moved Permanently
> < Location: http://www.google.com/
> < Content-Type: text/html; charset=UTF-8
> < Date: Wed, 05 Sep 2018 15:31:51 GMT
> < Expires: Fri, 05 Oct 2018 15:31:51 GMT
> < Cache-Control: public, max-age=2592000
> < Server: gws
> < Content-Length: 219
> < X-XSS-Protection: 1; mode=block
> < X-Frame-Options: SAMEORIGIN
> <
> <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
> <TITLE>301 Moved</TITLE></HEAD><BODY>
> <H1>301 Moved</H1>
> The document has moved
> <A HREF="http://www.google.com/">here</A>.
> </BODY></HTML>
> * Connection #0 to host google.com left intact
> $ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
> * Trying ::1...
> * TCP_NODELAY set
> * Connected to localhost (::1) port 9998 (#0)
> > PUT /rmeta/text HTTP/1.1
> > Host: localhost:9998
> > User-Agent: curl/7.54.0
> > Accept: */*
> > fileUrl:http://google.com
> >
> < HTTP/1.1 200 OK
> < Content-Type: application/json
> < Date: Wed, 05 Sep 2018 15:25:12 GMT
> < Transfer-Encoding: chunked
> < Server: Jetty(8.y.z-SNAPSHOT)
> <
> * Connection #0 to host localhost left intact
> [{"Content-Encoding":"UTF-8","Content-Type":"text/html; charset\u003dUTF-8","Content-Type-Hint":"text/html;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n
Search Images Maps Play YouTube News Gmail Drive More »\nWeb History | Settings | Sign in\n\n\n
\n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage tools\n\n\n\n\nGoogle offered in: Fran�ais
\n\n\nAdvertising�ProgramsBusiness Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy
- Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]{code}
>  
> I am using Tika server to pull files from S3 and parse them, but upon a redirect request,
it neither redirects nor returns an error code.
> See https://docs.aws.amazon.com/AmazonS3/latest/dev/Redirects.html
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message