spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <>
Subject Re: Read or save specific blocks of a file
Date Thu, 03 May 2018 18:41:18 GMT
Is this a recommended way of reading data in the long run? I think it might
be better to write or look for an InputFormat which supports the need

Btw Block is designed to be hdfs internal representation to enable certain
features. It would be interesting to understand the usecase where client
app really needs to know about it. It sounds like a questionable design
without that context


On Fri, 4 May 2018 at 1:46 am, Thodoris Zois <> wrote:

> Hello Madhav,
> What I did is pretty straight-forward. Let's say that your HDFS block is
> 128 MB and you store a file of 256 MBs in HDFS, named Test.csv.
> First use the command: `hdfs fsck Test.csv -locations -blocks -files`. It
> will return you some very useful information including the list of blocks.
> So let's say that you want to read the first block (block 0). On the right
> side of the line that corresponds to block 0 you can find the IP of the
> machine that holds this specific block in the local file system as well as
> the blockName ( and blockID (e.g:
> blk_1073760915_20091) that will help you later recognize it. So what you
> need from fsck is the blockName, blockID and the IP of the machine that has
> the specific block that you are interested in.
> After you get these you got everything you need. All you have to do is to
> connect to the specific IP and execute: `find
> /data/hdfs-data/datanode/current/blockName/current/finalized/subdir0/ -name
> blockID`. That command will return you the full path where you can find the
> contents of your file Test.csv that correspond to one block in HDFS.
> What I do after I get the full path is to copy the file, remove the last
> line (because there is a big chance that the last line will be included in
> the next block) and store it again to HDFS with the desired name. Then I
> can access one block of file Test.csv from HDFS. That's all, if you need
> any further information do no hesitate to contact me.
> - Thodoris
> On Thu, 2018-05-03 at 14:47 +0530, Madhav A wrote:
> Thodoris,
> I certainly would be interested in knowing how you were able to identify
> individual blocks and read from them. I was understanding that HDFS
> protocol abstracts this from the consumers to prevent potential data
> corruption issues. Appreciate if you please share some details of your
> approach.
> Thanks!
> madhav
> On Wed, May 2, 2018 at 3:34 AM, Thodoris Zois <> wrote:
> That’s what I did :) If you need further information I can post my
> solution..
> - Thodoris
> On 30 Apr 2018, at 22:23, David Quiroga <> wrote:
> There might be a better way... but I wonder if it might be possible to
> access the node where the block is store and read it from the local file
> system rather than from HDFS.
> On Mon, Apr 23, 2018 at 11:05 AM, Thodoris Zois <> wrote:
> Hello list,
> I have a file on HDFS that is divided into 10 blocks (partitions).
> Is there any way to retrieve data from a specific block? (e.g: using
> the blockID).
> Except that, is there any option to write the contents of each block
> (or of one block) into separate files?
> Thank you very much,
> Thodoris
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> --
Best Regards,
Ayan Guha

View raw message