mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gardner <gard...@pythian.com>
Subject Re: Increase the number of mappers/split file? for matrixmult
Date Thu, 20 Jun 2013 13:15:09 GMT
You need to set the size of the input splits; by default the
FileInputFormat will split on blocks. You can override this with
mapred.max.split.size; if you want 10 map jobs from a 100MB file, set the
flag -Dmapred.max.split.size=10485760 (10MB in bytes).

Splitting the files themselves up is bad long-term because every block and
file takes up memory in the Namenode. If you have a bunch of small files
(or a bunch of files split into small blocks), you may run out of memory in
the Namenode before you run out of disk space on your cluster. Of course,
federation is supposed to take care of this, but it's still best practice
to keep your files large.


On Thu, Jun 20, 2013 at 3:30 AM, Dan Filimon <dangeorge.filimon@gmail.com>wrote:

> Hi!
>
> I don't know the particular details of this job, but usually  the number of
> mappers being launched is a Hadoop problem. And Hadoop looks at the number
> of input splits as its main hint.
> So, if your matrices are split in multiple smaller files, you'll likely get
> multiple mappers.
>
> Since I assume your matrices are SequenceFiles, maybe try out this:
>
> https://github.com/apache/mahout/blob/trunk/examples/src/main/java/org/apache/mahout/clustering/streaming/tools/ResplitSequenceFiles.java
>
> This tool is called "resplit" and it should work for any Writables.
>
> https://github.com/apache/mahout/blob/trunk/src/conf/driver.classes.default.props
>
> See if resplitting works. :)
>
>
> On Thu, Jun 20, 2013 at 9:18 AM, Rafa Alfaro <ralfaro2002@gmail.com>
> wrote:
>
> > Hi,
> >
> > I'm trying to run the matrix multiplication of two relatively small
> > (4219*200)(200*54622) but it is taking too long because only a single
> > mapper is launched. I'm running this on a 10 node cluster.
> >
> > I have tried changing the MAHOUT_OPTS in the mahout file:
> >
> > MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=18"
> > MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=9"
> >
> > Also passing the options directly on the command:
> >
> > mahout matrixmult -Dmapred.map.tasks=18 -Dmapred.reduce.tasks=9
> > --numRowsA 200 --numColsA 4819 --numRowsB 200 --numColsB 54622
> > --inputPathA matrixA --inputPathB matrixB
> >
> > But no luck with this either.
> >
> > My Hadoop mapred-site.xml looks like this:
> >
> > <configuration>
> >   <property>
> >     <name>mapred.job.tracker</name>
> >     <value>serverX:54311</value>
> >     <final>true</final>
> >   </property>
> >   <property>
> >     <name>mapred.child.ulimit</name>
> >     <value>unlimited</value>
> >   </property>
> >   <property>
> >     <name>mapred.tasktracker.map.tasks.maximum</name>
> >     <value>2</value>
> >     <final>true</final>
> >   </property>
> >   <property>
> >     <name>mapred.tasktracker.reduce.tasks.maximum</name>
> >     <value>2</value>
> >     <final>true</final>
> >   </property>
> >   <property>
> >     <name>mapred.child.java.opts</name>
> >     <value>-Xmx2000m</value>
> >   </property>
> > </configuration>
> >
> > Am I missing something on the configuration?
> >
> > Right now with 1 mapper it is taking 4 min in average to advance 1%
> > with the mapper task.
> >
> > Thank you,
> > Rafael Alfaro
> >
>



-- 
Alan Gardner
Solutions Architect - CTO Office

gardner@pythian.com | LinkedIn:
http://www.linkedin.com/profile/view?id=65508699 |
@alanctgardner<https://twitter.com/alanctgardner>
Tel: +1 613 565 8696 x1218
Mobile: +1 613 897 5655

-- 


--




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message