nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1129) Any23 Nutch plugin
Date Wed, 10 Jan 2018 10:51:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320057#comment-16320057
] 

ASF GitHub Bot commented on NUTCH-1129:
---------------------------------------

nmaro commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356566774
 
 
   @lewismc Requested changes done - please note that
   
   * I had to extend the elastic http plugin to handle lists of Map objects that it previously
just stringified
   * Any23 couldn't detect as many triples as you expected in your tests, had to lower the
number - but it's good enough for us for now, people can still expand the any23 scope if they
find out what the problem is
   * Data is now indexed as follows (example after crawling `https://smartive.ch/jobs`):
   
   ```
             "structured_data": [
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"IE-edge,chrome=1\"@de",
                 "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
                 "short_key": "X-UA-Compatible"
               },
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"Wir sind smartive \\u2014 eine dynamische, innovative Schweizer
Webentwicklungsagentur. Die Realisierung zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt
genauso zu unserer Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
Kunden.\"@de",
                 "key": "<http://vocab.sindice.net/any23#description>",
                 "short_key": "description"
               },
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"width=device-width, initial-scale=1, shrink-to-fit=no\"@de",
                 "key": "<http://vocab.sindice.net/any23#viewport>",
                 "short_key": "viewport"
               },
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"width=device-width,initial-scale=1\"@de",
                 "key": "<http://vocab.sindice.net/any23#viewport>",
                 "short_key": "viewport"
               },
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"ie=edge\"@de",
                 "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
                 "short_key": "x-ua-compatible"
               }
             ],
   ```
     

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Any23 Nutch plugin
> ------------------
>
>                 Key: NUTCH-1129
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1129
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.5
>
>         Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin which extracts
RDF data from HTTP and file resources. Although as of writing Any23 not part of the ASF, the
project is working towards integration into the Apache Incubator. Once the project proves
its value, this would be an excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message