nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mateusz Zakarczemny <>
Subject Re: Get outlink URL's text context
Date Sun, 09 Mar 2014 20:12:40 GMT
Try to write custom parser plugin. You could base on
or org.apache.nutch.parse.tika.TikaParser

2014-03-09 16:09 GMT+01:00 cephtahrioh <>:

> Hi Guys. Anyone knows an efficient way to extract the text context that
> wraps an outlink URL. For example, given this sample text containing an
> outlink:
> Nutch can run on a single machine, but gains a lot of its strength from
> running in a Hadoop cluster. You can download Nutch here<>For
more information about Apache Nutch, please see the Nutch wiki.
> In this example, I would like to get the sentence containing the link, and
> a sentence before and after that sentence. Any way to do this efficiently?
> Any methods I can invoke to get something like the position of the link
> within a fetched content? Or even a part of the nutch code I can modify to
> do this? Thanks!
> ------------------------------
> View this message in context: Get outlink URL's text context<>
> Sent from the Nutch - Dev mailing list archive<>at

View raw message