lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] [Updated] (SOLR-7114) SimplePostTool fails crawling lucene.apache.org due to missing <html> tag
Date Thu, 19 Feb 2015 09:40:11 GMT

     [ https://issues.apache.org/jira/browse/SOLR-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jan Høydahl updated SOLR-7114:
------------------------------
    Description: 
A bunch of CMS pages lack the {{<html>}} and {{</html>}} tags. I don't know the
history of this, was it intentional? I tried to fix it, but it's a bit confusing. (This is
a spinoff from SOLR-7107).

Crawling lucene.apache.org with bin/post fails with 500 errors since Tika autodetect sees
{{<head>}} as the first tag and believes it is XML :-)

I *think* we're fine if all templates referred to from {{lib/path.pm}} have {{<html>}}
tags added, and that none of them include eachother. Currently, {{core.html}} is both a top-page
and also included from {{mirrors-core-latest-redir.html}} and {{mirrors-core-redir.html}}
for some reason.

To reproduce the crawl errors:
{code}
bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html
{code}

-We could in addition improve {{SimplePostTool}} to send a content-type hint to Tika.- *Update:
The tool already does this*

  was:
A bunch of CMS pages lack the {{<html>}} and {{</html>}} tags. I don't know the
history of this, was it intentional? I tried to fix it, but it's a bit confusing. (This is
a spinoff from SOLR-7107).

Crawling lucene.apache.org with bin/post fails with 500 errors since Tika autodetect sees
{{<head>}} as the first tag and believes it is XML :-)

I *think* we're fine if all templates referred to from {{lib/path.pm}} have {{<html>}}
tags added, and that none of them include eachother. Currently, {{core.html}} is both a top-page
and also included from {{mirrors-core-latest-redir.html}} and {{mirrors-core-redir.html}}
for some reason.

To reproduce the crawl errors:
{code}
bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html
{code}

We could in addition improve {{SimplePostTool}} to send a content-type hint to Tika.


> SimplePostTool fails crawling lucene.apache.org due to missing <html> tag
> -------------------------------------------------------------------------
>
>                 Key: SOLR-7114
>                 URL: https://issues.apache.org/jira/browse/SOLR-7114
>             Project: Solr
>          Issue Type: Bug
>          Components: SimplePostTool
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Minor
>              Labels: cms
>             Fix For: 5.1
>
>
> A bunch of CMS pages lack the {{<html>}} and {{</html>}} tags. I don't know
the history of this, was it intentional? I tried to fix it, but it's a bit confusing. (This
is a spinoff from SOLR-7107).
> Crawling lucene.apache.org with bin/post fails with 500 errors since Tika autodetect
sees {{<head>}} as the first tag and believes it is XML :-)
> I *think* we're fine if all templates referred to from {{lib/path.pm}} have {{<html>}}
tags added, and that none of them include eachother. Currently, {{core.html}} is both a top-page
and also included from {{mirrors-core-latest-redir.html}} and {{mirrors-core-redir.html}}
for some reason.
> To reproduce the crawl errors:
> {code}
> bin/post -c gettingstarted http://lucene.apache.org/core/corenews.html
> {code}
> -We could in addition improve {{SimplePostTool}} to send a content-type hint to Tika.-
*Update: The tool already does this*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message