nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cao yuzhong" <caoyuzh...@hotmail.com>
Subject Can Nutch index over 90G html pages ?
Date Thu, 02 Jun 2005 08:12:29 GMT
Have anyone used nutch to index over 90G html pages(about 6 million pages)?
Is it possible? How many rams does it require?

I tried to use Nutch to index 90G html pages.
My pc has 1G Ram and the JVM parameter set to -Xmx1000m
Following is my problem:

Exception in thread "main" java.lang.OutOfMemoryError
	at java.io.FileInputStream.readBytes(Native Method)
	at java.io.FileInputStream.read(FileInputStream.java:194)
at 
net.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:68)

	at 
net.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:24)

	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
	at java.io.DataInputStream.

	at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
	at java.io.DataInputStream.readFully(DataInputStream.java:176)
	at net.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:42)
	at net.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:76)
	at net.nutch.io.SequenceFile$Reader.next(SequenceFile.java:241)
	at net.nutch.io.MapFile$Reader.seek(MapFile.java:263)

	at net.nutch.io.MapFile$Reader.get(MapFile.java:306)
	at net.nutch.io.ArrayFile$Reader.get(ArrayFile.java:62)
	at net.nutch.segment.SegmentReader.get(SegmentReader.java:284)
	at net.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:110)
	at net.nutch.indexer.IndexSegment.main(IndexSegment.java:241)

Any seggestions?

Best regards!
cyz



Mime
View raw message