nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "NutchFileFormats" by LewisJohnMcgibbney
Date Fri, 25 Sep 2015 03:13:09 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchFileFormats" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchFileFormats?action=diff&rev1=7&rev2=8

  
  = CrawlDB =
  
- Content here is under construction.
- Content here is under construction.
+ == Description ==
+ 
+ Nutch maintains a CrawlDB containing [[http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/CrawlDatum.html|CrawlDatum]]
objects.
+ 
+ == Directory Structure ==
+ {{{
+ .
+ ├── current
+ │   └── part-00000
+ │       ├── data
+ │       └── index
+ └── old
+     ├── part-00000
+     │   ├── data
+     │   └── index
+     └── part-00001
+         ├── data
+         └── index
+     └── ...
+ }}}
+ 
+ == File Formats ==
+ 
+ {{{#!CSV ,
+ file,key datatype,value datatype,codec
+ data,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,
+ index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ }}}
  
  = LinkDB =
  
- Content here is under construction.
- Content here is under construction.
+ == Description ==
+ 
+ Maintains an inverted link map, listing incoming links for each url.
+ 
+ == Directory Structure ==
+ 
+ {{{
+ .
+ └── current
+     └── part-00000
+         ├── data
+         └── index
+ }}}
+ 
+ == File Formats ==
+ 
+ {{{#!CSV ,
+ file,key datatype,value datatype,codec
+ data,org.apache.hadoop.io.Text,org.apache.nutch.crawl.Inlinks,org.apache.hadoop.io.compress.DefaultCodec
+ index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ }}}
  
  = Segments =
  
+ == Description ==
+ 
- When Nutch crawls the web, each resulting segment has four subdirectories, each containing
an ArrayFile (a MapFile having keys that are long integers):
+ When Nutch crawls the web, each resulting segment (segments contain the actual content which
was fetched) has four subdirectories, each containing an ArrayFile (a MapFile having keys
that are long integers).
  
+ == Directory Structure ==
- {{{#!CSV ,
- Subdirectory,Value datatype,Variable
- fetchlist,net.nutch.pagedb.FetchListEntry,fetchList
- fetcher,net.nutch.fetcher.FetcherOutput,fetcherDb
- fetcher_content,net.nutch.fetcher.FetcherContent,rawDb
- fetcher_text,net.nutch.fetcher.FetcherText,strippedDb
- }}}
  
- Crawling is performed by net.nutch.fetcher.Fetcher which starts a number of parallel FetcherThread?.
Each thread gets an URL from the fetchList, checks robots.txt, retrieves the contents and
appends the results to fetcherDb, rawDb, and strippedDb.
+ {{{
+ .
+ ├── content
+ │   ├── part-00000
+ │   │   ├── data
+ │   │   └── index
+ │   └── part-...
+ ├── crawl_fetch
+ │   ├── part-00000
+ │   │   ├── data
+ │   │   └── index
+ │   └── part-...
+ ├── crawl_generate
+ │   └── part-00000
+ ├── crawl_parse
+ │   ├── part-00000
+ │   └── part-00001
+ ├── parse_data
+ │   ├── part-00000
+ │   │   ├── data
+ │   │   └── index
+ │   └── part-...
+ └── parse_text
+     ├── part-00000
+     │   ├── data
+     │   └── index
+     └── part-...
+ }}}
+ 
+ == Description ==
+ 
+ {{{#!CSV ,
+ Subdirectory,file,key datatype,value datatype,codec
+ content,data,org.apache.hadoop.io.Text,org.apache.nutch.protocol.Content,org.apache.hadoop.io.compress.DefaultCodec
+ content,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ crawl_fetch,data,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,org.apache.hadoop.io.compress.DefaultCodec
+ crawl_fetch,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ crawl_generate,part-0000,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,org.apache.hadoop.io.compress.DefaultCodec
+ crawl_parse,data,org.apache.hadoop.io.Text,org.apache.nutch.crawl.CrawlDatum,org.apache.hadoop.io.compress.DefaultCodec
+ crawl_parse,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ parse_data,data,org.apache.hadoop.io.Text,org.apache.nutch.parse.ParseData,org.apache.hadoop.io.compress.DefaultCodec
+ parse_data,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ parse_text,data,org.apache.hadoop.io.Text,org.apache.nutch.parse.ParseText,org.apache.hadoop.io.compress.DefaultCodec
+ parse_text,index,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.compress.DefaultCodec
+ }}}
  
  = Old File Format Documentation =
  

Mime
View raw message