nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mateusz Zakarczemny <mateusz.zakarcze...@up2data.pl>
Subject Re: Get outlink URL's text context
Date Sun, 09 Mar 2014 20:12:40 GMT
Try to write custom parser plugin. You could base on
org.apache.nutch.parse.html.HtmlParser
or org.apache.nutch.parse.tika.TikaParser


2014-03-09 16:09 GMT+01:00 cephtahrioh <ramonjesusburgos@gmail.com>:

> Hi Guys. Anyone knows an efficient way to extract the text context that
> wraps an outlink URL. For example, given this sample text containing an
> outlink:
>
> Nutch can run on a single machine, but gains a lot of its strength from
> running in a Hadoop cluster. You can download Nutch here<https://nutch.apache.org/downloads.html)>For
more information about Apache Nutch, please see the Nutch wiki.
>
> In this example, I would like to get the sentence containing the link, and
> a sentence before and after that sentence. Any way to do this efficiently?
> Any methods I can invoke to get something like the position of the link
> within a fetched content? Or even a part of the nutch code I can modify to
> do this? Thanks!
> ------------------------------
> View this message in context: Get outlink URL's text context<http://lucene.472066.n3.nabble.com/Get-outlink-URL-s-text-context-tp4122389.html>
> Sent from the Nutch - Dev mailing list archive<http://lucene.472066.n3.nabble.com/Nutch-Dev-f619766.html>at
Nabble.com.
>

Mime
View raw message