lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivprasad Shetty <Shivpras...@orioninc.com>
Subject Solr web crawler with recursive option
Date Thu, 11 Apr 2019 10:50:09 GMT


                I am working on solr for the first time and got the setup done. Now I have
created a core using command line and want to perform webcrawl of a third party site.
If I try it with individual links, I am able to do the crawl and index it to the core.This
was done using >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar http://www.example.com

Now what I intend to do is to give a url and using the recursive option (-Drecursive) and
let it crawl the entire site.
Note that I am pointing to a website that has around 125 pages and I am using the below command
>
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes -jar post.jar
http://www.example.com  and
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar post.jar http://www.example.com

and I am getting the below error message.
Error:


POSTed web resource http://www.example.com (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber:
1; columnNumber: 1; Content is not allowed in prolog.
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
        at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
        at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
        at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
        at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed
in prolog.
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
        ... 5 more



I would be very grateful if anyone could get me to solve this issue I have been trying to
fix for a couple of days.


Regards,
ShivprasadS


Confidentiality Notice: This e-mail message, including any attachments, is for the sole use
of the intended recipient(s) and may contain confidential and privileged information. Any
unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply e-mail, delete and then destroy all copies of
the original message.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message