hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johannes Herr (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-10921) MapFile.fix fails silently when file is block compressed
Date Fri, 01 Aug 2014 15:08:38 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Johannes Herr updated HADOOP-10921:
-----------------------------------

    Attachment: FixMapFileTest.java

> MapFile.fix fails silently when file is block compressed
> --------------------------------------------------------
>
>                 Key: HADOOP-10921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10921
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: Johannes Herr
>         Attachments: FixMapFileTest.java
>
>
> MapFile provides a method 'fix' to reconstruct missing 'index' files. If the 'data' file
is block compressed the method will compute offsets that are to large, which will lead to
keys not being found in the mapfile. (See the attached test case.)
> Tested against 0.20.2 but the trunk version looks like it has the same problem.
> Cause of the problem is, that 'dataReader.getPosition()' is used to find the offset to
write for the next entry that should be indexed. When the file is block compressed however
'dataReader.getPosition()' seems to return the  position of the next compressed block, not
of block that contains the last entry. This position will thus be to large in most cases and
a seek operation with this offset will incorrectly report the key as not present.
> I think its not obvious how to fix it, since the SequenceFile-Reader does not provide
the offset of the currently buffered entries. I've experimented with watching the offset change
and that seems to work mostly, but is quiet ugly and not exact in edge cases.
> The method should probably throw an exception when the 'data' file is block compressed
instead of silently creating invalid files. A workaround for block compressed files is to
read the sequence file and write the entries to a new mapfile and then replace the old file.
This also avoids the problems mentioned below.
> A few side notes: 
> 1. The 'index' files created by the fix-method are not block compressed (which the 'index'
files created by MapFile Writer always are, since the 'index' file is read completely anyway).
> 2. The fix method does not index the first entry, the Writer does.
> 3. The header offset is not used.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message