nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Strittmatter (JIRA)" <j...@apache.org>
Subject [jira] Aktualisiert: (NUTCH-20) Extract urls from plain texts
Date Tue, 02 Aug 2005 12:05:35 GMT
     [ http://issues.apache.org/jira/browse/NUTCH-20?page=all ]

Stephan Strittmatter updated NUTCH-20:
--------------------------------------

    Description: 
Some parsers have no Outlinks returned. E.g. the Word-Parser.
This class is able to extract (absolute) hyperlinks from a plain String (content)  and generates
outlinks from them.
This would be very usful for parser which have no explicite extraction of hyperlinks.

Excample:

Outlink[] links = OutlinkExtractor.getOutlinks("Nutch is located at http://www.apache.org
and ...");

Will return an array of Outlinks containing the one element of "http://www.apache.org".

----
transfered from: http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
submitted  by: Stephan Strittmatter

  was:
transfered from:
http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
submitted  by:
Stephan Strittmatter

Some parsers have no Outlinks returned. E.g. the
Word-Parser.


    Environment: 

>  Extract urls from plain texts
> ------------------------------
>
>          Key: NUTCH-20
>          URL: http://issues.apache.org/jira/browse/NUTCH-20
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stefan Grroschupf
>     Priority: Trivial
>  Attachments: OutlinkExtractor.java, OutlinkExtractor.java, OutlinkExtractor.java, TestOutlink.java,
TestOutlink.java, patch.txt
>
> Some parsers have no Outlinks returned. E.g. the Word-Parser.
> This class is able to extract (absolute) hyperlinks from a plain String (content)  and
generates outlinks from them.
> This would be very usful for parser which have no explicite extraction of hyperlinks.
> Excample:
> Outlink[] links = OutlinkExtractor.getOutlinks("Nutch is located at http://www.apache.org
and ...");
> Will return an array of Outlinks containing the one element of "http://www.apache.org".
> ----
> transfered from: http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
> submitted  by: Stephan Strittmatter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message