nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cephtahrioh <>
Subject Get outlink URL's text context
Date Sun, 09 Mar 2014 15:09:05 GMT
Hi Guys. Anyone knows an efficient way to extract the text context that wraps
an outlink URL. For example, given this sample text containing an outlink:
Nutch can run on a single machine, but gains a lot of its strength from
running in a Hadoop cluster. You can download Nutch  here
<>  For more information about
Apache Nutch, please see the Nutch wiki.
In this example, I would like to get the sentence containing the link, and a
sentence before and after that sentence. Any way to do this efficiently? Any
methods I can invoke to get something like the position of the link within a
fetched content? Or even a part of the nutch code I can modify to do this?

View this message in context:
Sent from the Nutch - Dev mailing list archive at
View raw message