manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bull1...@gmx.de
Subject URISnytaxException
Date Thu, 17 Feb 2011 12:27:20 GMT
Hi all,

I just checked out the newest version of MCF and now I am getting this 
error while crawling certain pages. What can I do against that?

Error Message:

java.net.URISyntaxException: Illegal character in path at index 73: 
/link/to/the/page/alan smithee.xls
        at java.net.URI$Parser.fail(URI.java:2809)
        at java.net.URI$Parser.checkChars(URI.java:2982)
        at java.net.URI$Parser.parseHierarchical(URI.java:3066)
        at java.net.URI$Parser.parse(URI.java:3024)
        at java.net.URI.<init>(URI.java:578)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.makeDocumentIdentifier(WebcrawlerConnector.java:4774)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:5586)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:5701)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:48)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:223)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:6492)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:5553)
        at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1132)
        at 
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
        at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)


How i set it up (hope that it helps):

        
            - installed postgreSQL 8.3.11-1
            - checked out the project into the MCF folder
            - added jcifs1.2.15.jar at /connectors/jcifs/jcifs and renamed 
it to jcifs.jar

            - built the project with ant at /mcf
            - copied the content of "dist" to c:/documents and 
settings/myUserAccount/lcf
            - added the properties.xml and the logging.ini there
            - created a synchronization folder
            - set MCF_HOME to the folder above
                
            - executed in /processes/scripts these commands: 
                org.apache.manifoldcf.core.DBCreate postgres p0sTgres
org.apache.manifoldcf.agents.Install 
org.apache.manifoldcf.agents.Register 
org.apache.manifoldcf.crawler.system.CrawlerAgent 
org.apache.manifoldcf.agents.RegisterOutput 
org.apache.manifoldcf.agents.output.solr.SolrConnector "SOLR Connector" 
org.apache.manifoldcf.authorities.RegisterAuthority 
org.apache.manifoldcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority 
"Active Directory Authority" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector 
"Filesystem Connector" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector "Database 
Connector" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector 
"Windows Share Connector" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.rss.RSSConnector "RSS Connector" 
org.apache.manifoldcf.crawler.Register 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector 
"Web Connector" 

        
            - and copied the content of /lcf/web/war to my /tomcat/webapps
                

Thanks for your help and Best regards,
Julian    
-- 
Schon gehört? GMX hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://www.gmx.net/de/go/toolbar

Mime
View raw message