hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "AbdulRahman AlHamali (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-6453) Repeatable Input File Format
Date Mon, 17 Aug 2015 16:59:46 GMT
AbdulRahman AlHamali created MAPREDUCE-6453:

             Summary: Repeatable Input File Format
                 Key: MAPREDUCE-6453
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6453
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
            Reporter: AbdulRahman AlHamali
            Assignee: AbdulRahman AlHamali
            Priority: Minor

We are interested in running the training process of deep learning architectures on Hadoop
clusters. We developed an algorithm that can carry out this training process in a MapReduce
fashion. However, there is still a problem that we can improve.

In deep learning, training data is usually repeated multiple times (10 or even more). However,
we were not able to find a way to go through the input training file multiple times without
having to reduce first and then go back and then map and reduce and so on so forth. So, to
carry on the experiments, we were forced to phyiscally repeat the files 10 or 20 times. This
is not the best solution, obviously, because first the file size is becoming much larger,
and second, it is not a neat way to carry out the job.

Thus, what we aim to do is to create an interface that input file formats can implement that
would provide them with the ability to repeat a file n times before eventually reducing, which
will solve the problem and make Hadoop more suitable for the training of deep learning algorithms,
or for such problems that require going over the data multiple times before reducing.

This message was sent by Atlassian JIRA

View raw message