nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Baclace <...@baclace.net>
Subject Re: Random number generators for NDFS block numbers
Date Tue, 27 Sep 2005 05:34:59 GMT
Doug Cutting wrote:
> It just occurred to me that perhaps we could simply use sequential block 
> numbering.  All block ids are generated centrally on the namenode.  

I'm not sure what the advantage of sequential block numbers would be
since long period PRNG block numbering does not even need to store
it's state, just pick a new starting place.

Sequential block numbering does have the downside that picking a
datanode based on (BlockNum % DataNodeCount) would devolve into
round robin.  Any attempt to pass the sequence through a hash
ends up becoming a random number generator.

Sequential numbering provides contiguous numbers, but after G.C.
that would be lost, so no advantage there.

When human beings eyeball block numbers, many with small differences
are more likely to be misread than many that are totally different.

If block numbering is sequential, then there is a temptation to use
32 bits instead of 64, but 32 bits leads to wrap-around and uh oh.

> Blocks are not logged until the file is closed, so there could be a 
> problem on restart if datanodes report blocks for files that were never 
> closed.  These would collide with yet-unallocated block numbers, 
> potentially corrupting the filesystem. 

I suppose a large margin of block ids could be skipped if there
was doubt about the previous shutdown, but the random block numbers
have many advantages.

Paul

Mime
View raw message