nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cephtahrioh <ramonjesusbur...@gmail.com>
Subject Get outlink URL's text context
Date Sun, 09 Mar 2014 15:09:05 GMT
Hi Guys. Anyone knows an efficient way to extract the text context that wraps
an outlink URL. For example, given this sample text containing an outlink:
Nutch can run on a single machine, but gains a lot of its strength from
running in a Hadoop cluster. You can download Nutch  here
<https://nutch.apache.org/downloads.html)>  For more information about
Apache Nutch, please see the Nutch wiki.
In this example, I would like to get the sentence containing the link, and a
sentence before and after that sentence. Any way to do this efficiently? Any
methods I can invoke to get something like the position of the link within a
fetched content? Or even a part of the nutch code I can modify to do this?
Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Get-outlink-URL-s-text-context-tp4122389.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.
Mime
View raw message