nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "GoogleSummerOfCode/GraphGeneratorTool" by OmkarReddy
Date Thu, 15 Jun 2017 12:26:55 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "GoogleSummerOfCode/GraphGeneratorTool" page has been changed by OmkarReddy:
https://wiki.apache.org/nutch/GoogleSummerOfCode/GraphGeneratorTool?action=diff&rev1=2&rev2=3

  <<TableOfContents>>
  
- ||'''Title :'''|||| GSoC 2016 Proposal ||
+ ||'''Title :'''|||| GSoC 2017 Proposal ||
  ||'''Issue :'''|||| [[https://issues.apache.org/jira/browse/NUTCH-2369|NUTCH-2369 - Graph
Generator Tool for Nutch]]||
  ||'''Student :'''||||Omkar Reddy - omkarr [at] apache dot org||
  ||'''Mentor :'''||||Lewis John McGibbney||
  
  === Abstract ===
  
- Currently Apache Nutch[0] has the concept of a WebGraph[1] which that builds Web graphs,
performs a stable convergent link-analysis, and updates the crawldb with those scores. The
main purpose of building a new Graph Generator tool for Nutch is to create a substantiated
‘deep’ graph enabling true traversal, this could be a game changer for how Nutch Crawl
data is interpreted. This will involve storage of  the crawl data as RDF datasets in the form
of serialized n-quad statements. This graph can be used to execute queries on the webpages.
Graph generation will be achieved using the Apache Tinkerpop[2] ScriptInputFormat  and ScriptOutputFormat’s[3]
respectively. There are basically two scenarios to represent the graph as RDF datasets that
we discuss in this proposal below.
+ Currently Apache Nutch[0] has the concept of a WebGraph[1] that builds Web graphs, performs
a stable convergent link-analysis, and updates the crawldb with those scores. The main purpose
of building a new Graph Generator tool for Nutch is to create a substantiated ‘deep’ graph
enabling true traversal, this could be a game changer for how Nutch Crawl data is interpreted.
This will involve storage of  the crawl data as RDF datasets in the form of serialized n-quad
statements. This graph can be used to execute queries on the webpages. Graph generation will
be achieved using the Apache Tinkerpop[2] ScriptInputFormat  and ScriptOutputFormat’s[3]
respectively. There are basically two scenarios to represent the graph as RDF datasets that
we discuss in this proposal below.
  
  === Background ===
  

Mime
View raw message