spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thodoris Zois <z...@ics.forth.gr>
Subject Re: Read or save specific blocks of a file
Date Thu, 03 May 2018 15:46:00 GMT
Hello Madhav,
What I did is pretty straight-forward. Let's say that your HDFS block
is 128 MB and you store a file of 256 MBs in HDFS, named Test.csv.
First use the command: `hdfs fsck Test.csv -locations -blocks -files`.
It will return you some very useful information including the list of
blocks. So let's say that you want to read the first block (block 0).
On the right side of the line that corresponds to block 0 you can find
the IP of the machine that holds this specific block in the local file
system as well as the blockName (BP-1737920335-xxx.xxx.x.x-
1510660262864) and blockID (e.g: blk_1073760915_20091) that will help
you later recognize it. So what you need from fsck is the blockName,
blockID and the IP of the machine that has the specific block that you
are interested in.
After you get these you got everything you need. All you have to do is
to connect to the specific IP and execute: `find /data/hdfs-
data/datanode/current/blockName/current/finalized/subdir0/ -name
blockID`. That command will return you the full path where you can find
the contents of your file Test.csv that correspond to one block in
HDFS.
What I do after I get the full path is to copy the file, remove the
last line (because there is a big chance that the last line will be
included in the next block) and store it again to HDFS with the desired
name. Then I can access one block of file Test.csv from HDFS. That's
all, if you need any further information do no hesitate to contact me.
- Thodoris

On Thu, 2018-05-03 at 14:47 +0530, Madhav A wrote:
> Thodoris,
> 
> 
> I certainly would be interested in knowing how you were able to
> identify individual blocks and read from them. I was understanding
> that HDFS protocol abstracts this from the consumers to prevent
> potential data corruption issues. Appreciate if you please share some
> details of your approach.
> 
> 
> Thanks!
> madhav
> On Wed, May 2, 2018 at 3:34 AM, Thodoris Zois <zois@ics.forth.gr>
> wrote:
> > That’s what I did :) If you need further information I can post my
> > solution.. 
> > 
> > - Thodoris
> > On 30 Apr 2018, at 22:23, David Quiroga <quirogadf4work@gmail.com>
> > wrote:
> > 
> > > There might be a better way... but I wonder if it might be
> > > possible to access the node where the block is store and read it
> > > from the local file system rather than from HDFS.  
> > > On Mon, Apr 23, 2018 at 11:05 AM, Thodoris Zois <zois@ics.forth.g
> > > r> wrote:
> > > > Hello list,
> > > > 
> > > > 
> > > > 
> > > > I have a file on HDFS that is divided into 10 blocks
> > > > (partitions). 
> > > > 
> > > > 
> > > > 
> > > > Is there any way to retrieve data from a specific block? (e.g:
> > > > using
> > > > 
> > > > the blockID). 
> > > > 
> > > > 
> > > > 
> > > > Except that, is there any option to write the contents of each
> > > > block
> > > > 
> > > > (or of one block) into separate files?
> > > > 
> > > > 
> > > > 
> > > > Thank you very much,
> > > > 
> > > > Thodoris 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > -------------------------------------------------------------
> > > > --------
> > > > 
> > > > To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> > > > 
> > > > For additional commands, e-mail: user-help@hadoop.apache.org
> > > > 
> > > > 
> > > > 
Mime
View raw message