Ah, I see it, thanks! Now for more coffee. On Wed, Feb 13, 2013 at 8:49 AM, Dave Beech wrote: > I haven't tried the code yet but I think it looks correct. > MultiSequenceFileRecordReader will get created via reflection and > needs the (CombineFileSplit split, TaskAttemptContext context, Integer > index) sig as its constructor. > > On 13 February 2013 16:40, Josh Wills wrote: > > Ha! Quite possibly. Let's JIRA it up. > > > > Victor, I haven't had much coffee yet, but it looks like there is a bug > in > > the gist-- the MultiSequenceFileInputFormat refers to a new > > CombineFileRecordReader, which has a different constructor signature from > > the MultiSequenceFileRecordReader in the patch. What did I miss? > > > > J > > > > > > On Wed, Feb 13, 2013 at 8:34 AM, Dave Beech > wrote: > > > >> Love it enough to write it for us? ;) I'll stick it in JIRA just in > >> case. Or if not, maybe one day I'll have a free couple of hours and > >> feel like doing it myself! > >> > >> Cheers, > >> Dave > >> > >> On 13 February 2013 16:18, Josh Wills wrote: > >> > Yep, I would love that. > >> > > >> > > >> > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech > >> wrote: > >> > > >> >> Actually, while we're on the subject of small files and > >> >> CombineFileInputFormat... > >> >> > >> >> I believe Hive has a feature whereby CombineFileInputFormat is used > >> >> internally if it's required to read many small files to make the > >> >> resulting mapreduce jobs more efficient. Would it be worth looking > >> >> into whether Crunch could support this, too? > >> >> > >> >> > >> >> On 13 February 2013 15:27, Dave Beech wrote: > >> >> > thanks! > >> >> > > >> >> > On 13 February 2013 15:22, Victor Iacoban < > victor.iacoban@gmail.com> > >> >> wrote: > >> >> >> https://gist.github.com/viacoban/4945325 > >> >> >> > >> >> >> > >> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech > > >> >> wrote: > >> >> >> > >> >> >>> A gist would be great - thanks very much > >> >> >>> > >> >> >>> Dave > >> >> >>> > >> >> >>> On 13 February 2013 14:52, Victor Iacoban < > victor.iacoban@gmail.com > >> > > >> >> >>> wrote: > >> >> >>> > Dave, > >> >> >>> > > >> >> >>> > How do you want this, copy pasted code into a gist or a > reusable > >> jar? > >> >> >>> > > >> >> >>> > --victor > >> >> >>> > > >> >> >>> > > >> >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech < > dave@paraliatech.com > >> > > >> >> >>> wrote: > >> >> >>> > > >> >> >>> >> Hi Victor, > >> >> >>> >> Any chance you could share your implementation of a Source > that > >> >> reads > >> >> >>> >> from multiple paths? I've wanted this for a while but haven't > >> found > >> >> >>> >> time to go ahead and write one myself! > >> >> >>> >> Thanks, > >> >> >>> >> Dave > >> >> >>> >> > >> >> >>> >> On 12 February 2013 23:07, Victor Iacoban < > >> victor.iacoban@gmail.com > >> >> > > >> >> >>> >> wrote: > >> >> >>> >> > Thanks J > >> >> >>> >> > > >> >> >>> >> > I could not extend the FileSourceImpl since it works with > only > >> one > >> >> >>> input > >> >> >>> >> > path, > >> >> >>> >> > but I implemented the Source interface directly and it > appears > >> to > >> >> do > >> >> >>> the > >> >> >>> >> > job, thx for the pointer > >> >> >>> >> > > >> >> >>> >> > -- victor > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills < > >> josh.wills@gmail.com > >> >> > > >> >> >>> >> wrote: > >> >> >>> >> > > >> >> >>> >> >> Yep-- check out the formattedFile function in > o.a.c.io.From. > >> You > >> >> can > >> >> >>> >> also > >> >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if > >> it's > >> >> one > >> >> >>> >> you're > >> >> >>> >> >> going to be using a lot, or if there is custom > configuration > >> >> >>> information > >> >> >>> >> >> required to use the InputFormat. > >> >> >>> >> >> > >> >> >>> >> >> J > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < > >> >> >>> >> victor.iacoban@gmail.com > >> >> >>> >> >> >wrote: > >> >> >>> >> >> > >> >> >>> >> >> > That's exactly what I have in the code not using Crunch > API: > >> >> >>> >> >> > public class MultiSequenceFileInputFormat extends > >> >> >>> >> >> > CombineFileInputFormat { > >> >> >>> >> >> > ... > >> >> >>> >> >> > } > >> >> >>> >> >> > > >> >> >>> >> >> > Are you saying there is way to use my custom input format > >> with > >> >> >>> Crunch? > >> >> >>> >> >> > > >> >> >>> >> >> > > >> >> >>> >> >> > > >> >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills < > >> >> josh.wills@gmail.com> > >> >> >>> >> >> wrote: > >> >> >>> >> >> > > >> >> >>> >> >> > > Depends on the size of the files-- if there are a > bunch of > >> >> tiny > >> >> >>> >> ones, > >> >> >>> >> >> it > >> >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala > >> >> >>> >> >> > > > >> >> >>> >> >> > > > >> >> >>> >> > >> >> > http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html > >> >> >>> >> >> > > > >> >> >>> >> >> > > J > >> >> >>> >> >> > > > >> >> >>> >> >> > > > >> >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < > >> >> >>> >> >> > victor.iacoban@gmail.com > >> >> >>> >> >> > > >wrote: > >> >> >>> >> >> > > > >> >> >>> >> >> > > > Thanks Josh, > >> >> >>> >> >> > > > Is there any performance penalty in unions, assuming > >> that I > >> >> >>> have > >> >> >>> >> >> > several > >> >> >>> >> >> > > > hundreds of input files? > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < > >> >> >>> josh.wills@gmail.com > >> >> >>> >> > > >> >> >>> >> >> > > wrote: > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work. > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > PTable first = pipeline.read(new > TableSource >> >> >>> >> >> V>(firstFile)); > >> >> >>> >> >> > > > > PTable second = ...; > >> >> >>> >> >> > > > > PTable union = first.union(second); > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > etc. > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < > >> >> >>> >> >> > > > victor.iacoban@gmail.com > >> >> >>> >> >> > > > > >wrote: > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > > Is there any support in crunch to use multiple > >> sequence > >> >> >>> files > >> >> >>> >> as > >> >> >>> >> >> > > > pipeline > >> >> >>> >> >> > > > > > source? > >> >> >>> >> >> > > > > > something similar to standard MultipleInputs > >> >> >>> >> >> > > > > > > >> >> >>> >> >> > > > > > Thanks, > >> >> >>> >> >> > > > > > victor > >> >> >>> >> >> > > > > > > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > >> >> >>> >> >> > > >> >> >>> >> >> > >> >> >>> >> > >> >> >>> > >> >> > >> > > >> > > >> > > >> > -- > >> > Director of Data Science > >> > Cloudera > >> > Twitter: @josh_wills > >> > >> > >> On 13 February 2013 16:18, Josh Wills wrote: > >> > Yep, I would love that. > >> > > >> > > >> > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech > >> wrote: > >> > > >> >> Actually, while we're on the subject of small files and > >> >> CombineFileInputFormat... > >> >> > >> >> I believe Hive has a feature whereby CombineFileInputFormat is used > >> >> internally if it's required to read many small files to make the > >> >> resulting mapreduce jobs more efficient. Would it be worth looking > >> >> into whether Crunch could support this, too? > >> >> > >> >> > >> >> On 13 February 2013 15:27, Dave Beech wrote: > >> >> > thanks! > >> >> > > >> >> > On 13 February 2013 15:22, Victor Iacoban < > victor.iacoban@gmail.com> > >> >> wrote: > >> >> >> https://gist.github.com/viacoban/4945325 > >> >> >> > >> >> >> > >> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech > > >> >> wrote: > >> >> >> > >> >> >>> A gist would be great - thanks very much > >> >> >>> > >> >> >>> Dave > >> >> >>> > >> >> >>> On 13 February 2013 14:52, Victor Iacoban < > victor.iacoban@gmail.com > >> > > >> >> >>> wrote: > >> >> >>> > Dave, > >> >> >>> > > >> >> >>> > How do you want this, copy pasted code into a gist or a > reusable > >> jar? > >> >> >>> > > >> >> >>> > --victor > >> >> >>> > > >> >> >>> > > >> >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech < > dave@paraliatech.com > >> > > >> >> >>> wrote: > >> >> >>> > > >> >> >>> >> Hi Victor, > >> >> >>> >> Any chance you could share your implementation of a Source > that > >> >> reads > >> >> >>> >> from multiple paths? I've wanted this for a while but haven't > >> found > >> >> >>> >> time to go ahead and write one myself! > >> >> >>> >> Thanks, > >> >> >>> >> Dave > >> >> >>> >> > >> >> >>> >> On 12 February 2013 23:07, Victor Iacoban < > >> victor.iacoban@gmail.com > >> >> > > >> >> >>> >> wrote: > >> >> >>> >> > Thanks J > >> >> >>> >> > > >> >> >>> >> > I could not extend the FileSourceImpl since it works with > only > >> one > >> >> >>> input > >> >> >>> >> > path, > >> >> >>> >> > but I implemented the Source interface directly and it > appears > >> to > >> >> do > >> >> >>> the > >> >> >>> >> > job, thx for the pointer > >> >> >>> >> > > >> >> >>> >> > -- victor > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills < > >> josh.wills@gmail.com > >> >> > > >> >> >>> >> wrote: > >> >> >>> >> > > >> >> >>> >> >> Yep-- check out the formattedFile function in > o.a.c.io.From. > >> You > >> >> can > >> >> >>> >> also > >> >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if > >> it's > >> >> one > >> >> >>> >> you're > >> >> >>> >> >> going to be using a lot, or if there is custom > configuration > >> >> >>> information > >> >> >>> >> >> required to use the InputFormat. > >> >> >>> >> >> > >> >> >>> >> >> J > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < > >> >> >>> >> victor.iacoban@gmail.com > >> >> >>> >> >> >wrote: > >> >> >>> >> >> > >> >> >>> >> >> > That's exactly what I have in the code not using Crunch > API: > >> >> >>> >> >> > public class MultiSequenceFileInputFormat extends > >> >> >>> >> >> > CombineFileInputFormat { > >> >> >>> >> >> > ... > >> >> >>> >> >> > } > >> >> >>> >> >> > > >> >> >>> >> >> > Are you saying there is way to use my custom input format > >> with > >> >> >>> Crunch? > >> >> >>> >> >> > > >> >> >>> >> >> > > >> >> >>> >> >> > > >> >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills < > >> >> josh.wills@gmail.com> > >> >> >>> >> >> wrote: > >> >> >>> >> >> > > >> >> >>> >> >> > > Depends on the size of the files-- if there are a > bunch of > >> >> tiny > >> >> >>> >> ones, > >> >> >>> >> >> it > >> >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala > >> >> >>> >> >> > > > >> >> >>> >> >> > > > >> >> >>> >> > >> >> > http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html > >> >> >>> >> >> > > > >> >> >>> >> >> > > J > >> >> >>> >> >> > > > >> >> >>> >> >> > > > >> >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < > >> >> >>> >> >> > victor.iacoban@gmail.com > >> >> >>> >> >> > > >wrote: > >> >> >>> >> >> > > > >> >> >>> >> >> > > > Thanks Josh, > >> >> >>> >> >> > > > Is there any performance penalty in unions, assuming > >> that I > >> >> >>> have > >> >> >>> >> >> > several > >> >> >>> >> >> > > > hundreds of input files? > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < > >> >> >>> josh.wills@gmail.com > >> >> >>> >> > > >> >> >>> >> >> > > wrote: > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work. > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > PTable first = pipeline.read(new > TableSource >> >> >>> >> >> V>(firstFile)); > >> >> >>> >> >> > > > > PTable second = ...; > >> >> >>> >> >> > > > > PTable union = first.union(second); > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > etc. > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < > >> >> >>> >> >> > > > victor.iacoban@gmail.com > >> >> >>> >> >> > > > > >wrote: > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > > Is there any support in crunch to use multiple > >> sequence > >> >> >>> files > >> >> >>> >> as > >> >> >>> >> >> > > > pipeline > >> >> >>> >> >> > > > > > source? > >> >> >>> >> >> > > > > > something similar to standard MultipleInputs > >> >> >>> >> >> > > > > > > >> >> >>> >> >> > > > > > Thanks, > >> >> >>> >> >> > > > > > victor > >> >> >>> >> >> > > > > > > >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > >> >> >>> >> >> > > >> >> >>> >> >> > >> >> >>> >> > >> >> >>> > >> >> > >> > > >> > > >> > > >> > -- > >> > Director of Data Science > >> > Cloudera > >> > Twitter: @josh_wills > >> >