nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vangelis Karvounis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
Date Thu, 06 Mar 2014 14:46:46 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922575#comment-13922575
] 

Vangelis Karvounis commented on NUTCH-1478:
-------------------------------------------

Thanks for the answer Talat!
Let's say we crawl the url: http://www.uefa.com/worldcup/video/videoid=2064600.html?autoplay=true.

Its page's source tells us: 
<!DOCTYPE html><html lang="en"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#
video: http://ogp.me/ns/video# "><title>Veloso's World Cup dream for Portugal - FIFA
World Cup - Video - UEFA.com</title><meta http-equiv="X-UA-Compatible" content="IE=edge"
/><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta name="description"
content="&quot;It will be a unique and unforgettable event,&quot; Portugal's Miguel
Veloso told UEFA.com as the FIFA World Cup in Brazil nears, but he knows they have been handed
a tough group." /><meta name="keywords" content="velosos,world,cup,dream,portugal,Miguel
Veloso,Portugal,Ukraine,Dynamo Kyiv" /><meta name="author" content="uefa.com" /><meta
property="og:type" content="video.other" /><meta property="og:title" content="The official
website for European football – UEFA.com" /><meta property="og:url" content="http://www.uefa.com/worldcup/video/videoid=2064600.html"
/><meta property="og:image" content="http://www.uefa.com/MultimediaFiles/Photo/competitions/General/02/06/23/87/2062387_s2.jpg
" /><meta property="og:description" content="&quot;It will be a unique and unforgettable
event,&quot; Portugal's Miguel Veloso told UEFA.com as the FIFA World Cup in Brazil nears,
but he knows they have been handed a tough group." /><meta property="og:site_name" content="UEFA.com"
/><meta property="video:release_date" content="2014-03-04T9:00Z" /><meta property="video:tag"
content="velosos" /><meta property="video:tag" content="world" /><meta property="video:tag"
content="cup" /><meta property="video:tag" content="dream" /><meta property="video:tag"
content="portugal" /><meta property="video:tag" content="Miguel Veloso" /><meta
property="video:tag" content="Portugal" /><meta property="video:tag" content="Ukraine"
/><meta property="video:tag" content="Dynamo Kyiv" /><meta name="thumb" content="/multimediafiles/photo/competitions/general/02/06/23/87/2062387_s5.jpg"
/><meta name="date" content="Tuesday 4 March 2014" /><meta name="isodate" content="2014-03-04"
/><meta name="phototitle" content="Veluso" /><link rel="canonical" href="http://www.uefa.com/worldcup/video/videoid=2064600.html"
/><link rel="image_src" href="http://www.uefa.com/multimediafiles/photo/competitions/general/02/06/23/87/2062387_s5.jpg">
</link><meta name="viewport" content="width=device-width, initial-scale=1.0" /><script
type="text/javascript"> 

I am interested in extracting the info <meta property="og:image" content="http://www.uefa.com/MultimediaFiles/Photo/competitions/General/02/06/23/87/2062387_s2.jpg
" /> OR/AND the info <meta property="video:tag" content="cup" />. 

Do you think that parser can achieve this or we need to implement something else? 

Thank you in advance!

> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --------------------------------------------------------------
>
>                 Key: NUTCH-1478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1478
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.1
>            Reporter: kiran
>             Fix For: 2.3
>
>         Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch, NUTCH-1478v4.patch,
NUTCH-1478v5.patch, Nutch1478.patch, Nutch1478.zip, metadata_parseChecker_sites.png
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  This will
take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but
one change is that there is no need to give 'metatag' keyword before metatag names. For example
my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)

> This is only the first version and does not include the junit test. I will update the
new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the fields
in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message