lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <>
Subject [jira] [Created] (SOLR-7114) SimplePostTool fails crawling due to missing <html> tag
Date Sat, 14 Feb 2015 22:41:11 GMT
Jan Høydahl created SOLR-7114:

             Summary: SimplePostTool fails crawling due to missing <html>
                 Key: SOLR-7114
             Project: Solr
          Issue Type: Bug
          Components: SimplePostTool
            Reporter: Jan Høydahl
            Assignee: Jan Høydahl
            Priority: Minor
             Fix For: 5.1

A bunch of CMS pages lack the {{<html>}} and {{</html>}} tags. I don't know the
history of this, was it intentional? I tried to fix it, but it's a bit confusing. (This is
a spinoff from SOLR-7107).

Crawling with bin/post fails with 500 errors since Tika autodetect sees
{{<head>}} as the first tag and believes it is XML :-)

I *think* we're fine if all templates referred to from {{lib/}} have {{<html>}}
tags added, and that none of them include eachother. Currently, {{core.html}} is both a top-page
and also included from {{mirrors-core-latest-redir.html}} and {{mirrors-core-redir.html}}
for some reason.

To reproduce the crawl errors:
bin/post -c gettingstarted

We could in addition improve {{SimplePostTool}} to send a content-type hint to Tika.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message