nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenhao Xu <xuwenhao2...@gmail.com>
Subject Re: How to use Nutch index files on localdisk?
Date Sun, 13 Feb 2011 00:01:12 GMT
Hi Markus,
  Thanks. It works.
  But Nutch seems only crawling a single level of directory.  My directory
structure is:
nutch-rawl
   |-- conf
        |----  many xml and text files
   | --- new

   Below is the snapshot of crawl command's output.  It stops fetching at
depth 1.  I glanced the protocol-file implementation. It reads the
directory/file and generate a html response with links to reflect the
directory structure. Therefore, after fetching, the HTMLparser should be
called and update the crawl db correctly. And then next round of fetching
should happen. However,  Here, it only fetch nutch-crawl directory.
    Does anybody have any advice on this? I am a newbie on Nutch. Thanks for
the help.

rootUrlDir = urls
threads = 10
depth = 3
indexer=lucene
topN = 50
Injector: starting at 2011-02-12 15:48:12
Injector: crawlDb: crawl_local_results/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-02-12 15:48:14, elapsed: 00:00:02
Generator: starting at 2011-02-12 15:48:14
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
...
inishing thread FetcherThread, activeThreads=1
fetching file:///Users/peter/storage/nutch-crawl/
...
Fetcher: finished at 2011-02-12 15:48:20, elapsed: 00:00:02
ParseSegment: starting at 2011-02-12 15:48:20
ParseSegment: segment: crawl_local_results/segments/20110212154817
ParseSegment: finished at 2011-02-12 15:48:21, elapsed: 00:00:01
CrawlDb update: starting at 2011-02-12 15:48:22
CrawlDb update: db: crawl_local_results/crawldb
....
Generator: starting at 2011-02-12 15:48:24
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.


Regards,
Wen.

On Thu, Feb 10, 2011 at 2:59 AM, Markus Jelsma
<markus.jelsma@openindex.io>wrote:

> Here's an old post on this one which probably doesn't work anymore:
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
>
> And here the info on the Wiki's FAQ page:
> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
>
>
> On Wednesday 09 February 2011 19:19:48 Wenhao Xu wrote:
> > Hi all,
> >    I am new to Nutch. I want to use  Nutch's MapReduce indexer to index
> > files on a local filesystem. And I want to customize the field adding to
> > the index. I searched the Internet for a while, but haven't found the
> > answer. Could you give me some advice? Thanks very much.
> >
> > Regards,
> > Wen
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
~_~

Mime
View raw message