mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Josal <r...@josal.com>
Subject Re: Run more than one mapper for TestForest?
Date Mon, 29 Jul 2013 00:37:16 GMT
Late reply, but for what it's still worth, since I've seen a couple other threads here on the
topic of too few mappers, I added a parameter to set a minimum number of mappers.  Some of
my mahout jobs needed more mappers, but were not given many because of the small input file
size.

        addOption("minMapTasks", "m", "Minimum number of map tasks to run", String.valueOf(1));


        int minMapTasks = Integer.parseInt(getOption("minMapTasks"));
        int mapTasksThatWouldRun = (int) (vectorFileSizeBytes/getSplitSize()) + 1;
        log.info("map tasks min: " + minMapTasks + " current: " + mapTasksThatWouldRun);
        if (minMapTasks > mapTasksThatWouldRun) {
            String splitSizeBytes = String.valueOf(vectorFileSizeBytes/minMapTasks);
            log.info("Forcing mapred.max.split.size to " + splitSizeBytes + " to ensure minimum
map tasks = " + minMapTasks);
            hadoopConf.set("mapred.max.split.size", splitSizeBytes);
        }

    // there is actually a private method in hadoop to calculate this
    private long getSplitSize() {
        long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024 * 1024);
        long maxSize = hadoopConf.getLong("mapred.max.split.size", Long.MAX_VALUE);
        int minSize = hadoopConf.getInt("mapred.min.split.size", 1);
        long splitSize = Math.max(minSize, Math.min(maxSize, blockSize));
        log.info(String.format("min: %,d block: %,d max: %,d split: %,d", minSize, blockSize,
maxSize, splitSize));
        return splitSize;
    }

It seems like there should be a more straightforward way to do this, but it works for me and
I've used it on a lot of jobs to set a minimum number of mappers.

Ryan

On Jul 5, 2013, at 2:00 PM, Adam Baron wrote:

> I'm attempting to run org.apache.mahout.classifier.df.mapreduce.TestForest
> on a CSV with 200,000 rows that have 500,000 features per row.
> However, TestForest is  running extremely slow, likely because only 1
> mapper was assigned to the job.  This seems strange because
> the org.apache.mahout.classifier.df.mapreduce.BuildForest step on the same
> data used 1772 mappers and took about 6 minutes.  (BTW: I know I
> *shouldn't* use the same data set for the training and the testing steps;
> this is purely a technical experiment to see if Mahout's Random Forest can
> handle the data sizes we typically deal with).
> 
> Any idea on how to get org.apache.mahout.classifier.df.mapreduce.TestForest
> to use more mappers?  Glancing at the code (and thinking about what is
> happening intuitively), it should be ripe for parallelization.
> 
> Thanks,
>        Adam


Mime
View raw message