hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binglin Chang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-3086) Supporting range scan using TFile, TotalOrderPartitioner and partition index
Date Sun, 25 Sep 2011 08:19:26 GMT
Supporting range scan using TFile, TotalOrderPartitioner and partition index

                 Key: MAPREDUCE-3086
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3086
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
            Reporter: Binglin Chang
             Fix For:, 0.23.0

Hive/HBase already has similar or more powerful functionality, but using hive/hbase is overkill
or inconvenient for some cases, so add some lightweight utility classes to only support range
scan should be reasonable. The utility classes include:
# InputFormat supporting range scan: Indexed(Text|Binary)InputFormat
  The input directory for IndexInputFormat should contain one partition index and many tfiles,
each tfile store a certain range of keys, not overlapping with other tfiles, the boundaries
are stored in partition index.
  Add 4 jobconfs: mapred.indexed(text|binary)inputformat.key.(start|end), indicate range scan
  For a mapreduce job using IndexedInputFormat, IndexedInputFormat.getSplits filter out tfiles
which are not in the scan range using partition index
  IndexedInputFormat do not support multi directory & splitting in single file, these
can be added in future.
# Tool to convert data of other format into IndexedInputForamt: TotalOrderIndexBuilder
  If the input data is already total order partitioned and is tfile format, just add partition
index to input directory
  Or run InputSampler to generate partiton index, then run mapreduce job with TotalOrder partitioner
to generate tfile backed data, finally move partition index to output directory. 
# Client tool to scan/search indexed data directory

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message