[ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Richard Frovarp updated DROIDS-109:
-----------------------------------
Affects Version/s: (was: Graduating from the Incubator)
0.0.2
Fix Version/s: (was: Graduating from the Incubator)
> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
> Key: DROIDS-109
> URL: https://issues.apache.org/jira/browse/DROIDS-109
> Project: Droids
> Issue Type: Bug
> Components: core, norobots
> Affects Versions: 0.0.2
> Reporter: Fuad Efendi
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently supports only
URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before
applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath();
returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) should be the
one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html,
bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I
am working on it right now...
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least
outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it will become
unofficial standard.
> *Update from Google:*
> http://code.google.com/web/controlcrawlindex/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
|