From crunch-dev-return-2031-apmail-incubator-crunch-dev-archive=incubator.apache.org@incubator.apache.org Wed Feb 13 16:49:47 2013 Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5B526EF92 for ; Wed, 13 Feb 2013 16:49:47 +0000 (UTC) Received: (qmail 93202 invoked by uid 500); 13 Feb 2013 16:49:47 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 93177 invoked by uid 500); 13 Feb 2013 16:49:47 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 93169 invoked by uid 99); 13 Feb 2013 16:49:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 16:49:47 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.223.175] (HELO mail-ie0-f175.google.com) (209.85.223.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2013 16:49:43 +0000 Received: by mail-ie0-f175.google.com with SMTP id c12so1936811ieb.6 for ; Wed, 13 Feb 2013 08:49:21 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=gE1tb2/jiQtIozTWgq1L8udiVkdwoPmLyLQeLe+R6/Y=; b=ElWKGA9t/WJM207rFBh34mk/wQZZ7REVmiWEP09DKxu3HJrWF1TjL+08wQUXSy80v4 vD3uGSAKkHH8qZZjEmMRQ+6wl0qVqJ2jsqyKw5oOZRy1Qvh48K+WlLzdnKNLwul2MGSz cZBNc2ZCb1N8BylXWK88BrXhT3m+4KV78AkRWnizX1jceNfUMb5baPrW308U5Nbr36QA SF6DxyF6pUgibm4HmRPCEftJgy/UU/eHx5gVGmxpwMVX4AeB4Lym1hJolTwzyZaJdwlt UomFNMyuqYypPlbHk9zw4iEKvopbXFwJMggA9KEdC/pSmtq9Yk43hsqxEJWZn9M2QI3R /hAQ== MIME-Version: 1.0 X-Received: by 10.50.40.131 with SMTP id x3mr12427124igk.10.1360774161728; Wed, 13 Feb 2013 08:49:21 -0800 (PST) Received: by 10.43.134.69 with HTTP; Wed, 13 Feb 2013 08:49:21 -0800 (PST) In-Reply-To: References: Date: Wed, 13 Feb 2013 16:49:21 +0000 Message-ID: Subject: Re: multiple input files as pipeline source? From: Dave Beech To: crunch-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQnytD9WrkFAgbPAau5Hu2mEKhTV9bYvqjrGA9HJf17l5dqFIvwBoN+H9oByM+81W5rNEs6l X-Virus-Checked: Checked by ClamAV on apache.org I haven't tried the code yet but I think it looks correct. MultiSequenceFileRecordReader will get created via reflection and needs the (CombineFileSplit split, TaskAttemptContext context, Integer index) sig as its constructor. On 13 February 2013 16:40, Josh Wills wrote: > Ha! Quite possibly. Let's JIRA it up. > > Victor, I haven't had much coffee yet, but it looks like there is a bug in > the gist-- the MultiSequenceFileInputFormat refers to a new > CombineFileRecordReader, which has a different constructor signature from > the MultiSequenceFileRecordReader in the patch. What did I miss? > > J > > > On Wed, Feb 13, 2013 at 8:34 AM, Dave Beech wrote: > >> Love it enough to write it for us? ;) I'll stick it in JIRA just in >> case. Or if not, maybe one day I'll have a free couple of hours and >> feel like doing it myself! >> >> Cheers, >> Dave >> >> On 13 February 2013 16:18, Josh Wills wrote: >> > Yep, I would love that. >> > >> > >> > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech >> wrote: >> > >> >> Actually, while we're on the subject of small files and >> >> CombineFileInputFormat... >> >> >> >> I believe Hive has a feature whereby CombineFileInputFormat is used >> >> internally if it's required to read many small files to make the >> >> resulting mapreduce jobs more efficient. Would it be worth looking >> >> into whether Crunch could support this, too? >> >> >> >> >> >> On 13 February 2013 15:27, Dave Beech wrote: >> >> > thanks! >> >> > >> >> > On 13 February 2013 15:22, Victor Iacoban >> >> wrote: >> >> >> https://gist.github.com/viacoban/4945325 >> >> >> >> >> >> >> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech >> >> wrote: >> >> >> >> >> >>> A gist would be great - thanks very much >> >> >>> >> >> >>> Dave >> >> >>> >> >> >>> On 13 February 2013 14:52, Victor Iacoban > > >> >> >>> wrote: >> >> >>> > Dave, >> >> >>> > >> >> >>> > How do you want this, copy pasted code into a gist or a reusable >> jar? >> >> >>> > >> >> >>> > --victor >> >> >>> > >> >> >>> > >> >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech > > >> >> >>> wrote: >> >> >>> > >> >> >>> >> Hi Victor, >> >> >>> >> Any chance you could share your implementation of a Source that >> >> reads >> >> >>> >> from multiple paths? I've wanted this for a while but haven't >> found >> >> >>> >> time to go ahead and write one myself! >> >> >>> >> Thanks, >> >> >>> >> Dave >> >> >>> >> >> >> >>> >> On 12 February 2013 23:07, Victor Iacoban < >> victor.iacoban@gmail.com >> >> > >> >> >>> >> wrote: >> >> >>> >> > Thanks J >> >> >>> >> > >> >> >>> >> > I could not extend the FileSourceImpl since it works with only >> one >> >> >>> input >> >> >>> >> > path, >> >> >>> >> > but I implemented the Source interface directly and it appears >> to >> >> do >> >> >>> the >> >> >>> >> > job, thx for the pointer >> >> >>> >> > >> >> >>> >> > -- victor >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills < >> josh.wills@gmail.com >> >> > >> >> >>> >> wrote: >> >> >>> >> > >> >> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. >> You >> >> can >> >> >>> >> also >> >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if >> it's >> >> one >> >> >>> >> you're >> >> >>> >> >> going to be using a lot, or if there is custom configuration >> >> >>> information >> >> >>> >> >> required to use the InputFormat. >> >> >>> >> >> >> >> >>> >> >> J >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < >> >> >>> >> victor.iacoban@gmail.com >> >> >>> >> >> >wrote: >> >> >>> >> >> >> >> >>> >> >> > That's exactly what I have in the code not using Crunch API: >> >> >>> >> >> > public class MultiSequenceFileInputFormat extends >> >> >>> >> >> > CombineFileInputFormat { >> >> >>> >> >> > ... >> >> >>> >> >> > } >> >> >>> >> >> > >> >> >>> >> >> > Are you saying there is way to use my custom input format >> with >> >> >>> Crunch? >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills < >> >> josh.wills@gmail.com> >> >> >>> >> >> wrote: >> >> >>> >> >> > >> >> >>> >> >> > > Depends on the size of the files-- if there are a bunch of >> >> tiny >> >> >>> >> ones, >> >> >>> >> >> it >> >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala >> >> >>> >> >> > > >> >> >>> >> >> > > >> >> >>> >> >> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html >> >> >>> >> >> > > >> >> >>> >> >> > > J >> >> >>> >> >> > > >> >> >>> >> >> > > >> >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < >> >> >>> >> >> > victor.iacoban@gmail.com >> >> >>> >> >> > > >wrote: >> >> >>> >> >> > > >> >> >>> >> >> > > > Thanks Josh, >> >> >>> >> >> > > > Is there any performance penalty in unions, assuming >> that I >> >> >>> have >> >> >>> >> >> > several >> >> >>> >> >> > > > hundreds of input files? >> >> >>> >> >> > > > >> >> >>> >> >> > > > >> >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < >> >> >>> josh.wills@gmail.com >> >> >>> >> > >> >> >>> >> >> > > wrote: >> >> >>> >> >> > > > >> >> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work. >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > PTable first = pipeline.read(new TableSource> >> >>> >> >> V>(firstFile)); >> >> >>> >> >> > > > > PTable second = ...; >> >> >>> >> >> > > > > PTable union = first.union(second); >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > etc. >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < >> >> >>> >> >> > > > victor.iacoban@gmail.com >> >> >>> >> >> > > > > >wrote: >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > > Is there any support in crunch to use multiple >> sequence >> >> >>> files >> >> >>> >> as >> >> >>> >> >> > > > pipeline >> >> >>> >> >> > > > > > source? >> >> >>> >> >> > > > > > something similar to standard MultipleInputs >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > > Thanks, >> >> >>> >> >> > > > > > victor >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > >> >> >>> >> >> > > >> >> >>> >> >> > >> >> >>> >> >> >> >> >>> >> >> >> >>> >> >> >> > >> > >> > >> > -- >> > Director of Data Science >> > Cloudera >> > Twitter: @josh_wills >> >> >> On 13 February 2013 16:18, Josh Wills wrote: >> > Yep, I would love that. >> > >> > >> > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech >> wrote: >> > >> >> Actually, while we're on the subject of small files and >> >> CombineFileInputFormat... >> >> >> >> I believe Hive has a feature whereby CombineFileInputFormat is used >> >> internally if it's required to read many small files to make the >> >> resulting mapreduce jobs more efficient. Would it be worth looking >> >> into whether Crunch could support this, too? >> >> >> >> >> >> On 13 February 2013 15:27, Dave Beech wrote: >> >> > thanks! >> >> > >> >> > On 13 February 2013 15:22, Victor Iacoban >> >> wrote: >> >> >> https://gist.github.com/viacoban/4945325 >> >> >> >> >> >> >> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech >> >> wrote: >> >> >> >> >> >>> A gist would be great - thanks very much >> >> >>> >> >> >>> Dave >> >> >>> >> >> >>> On 13 February 2013 14:52, Victor Iacoban > > >> >> >>> wrote: >> >> >>> > Dave, >> >> >>> > >> >> >>> > How do you want this, copy pasted code into a gist or a reusable >> jar? >> >> >>> > >> >> >>> > --victor >> >> >>> > >> >> >>> > >> >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech > > >> >> >>> wrote: >> >> >>> > >> >> >>> >> Hi Victor, >> >> >>> >> Any chance you could share your implementation of a Source that >> >> reads >> >> >>> >> from multiple paths? I've wanted this for a while but haven't >> found >> >> >>> >> time to go ahead and write one myself! >> >> >>> >> Thanks, >> >> >>> >> Dave >> >> >>> >> >> >> >>> >> On 12 February 2013 23:07, Victor Iacoban < >> victor.iacoban@gmail.com >> >> > >> >> >>> >> wrote: >> >> >>> >> > Thanks J >> >> >>> >> > >> >> >>> >> > I could not extend the FileSourceImpl since it works with only >> one >> >> >>> input >> >> >>> >> > path, >> >> >>> >> > but I implemented the Source interface directly and it appears >> to >> >> do >> >> >>> the >> >> >>> >> > job, thx for the pointer >> >> >>> >> > >> >> >>> >> > -- victor >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills < >> josh.wills@gmail.com >> >> > >> >> >>> >> wrote: >> >> >>> >> > >> >> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. >> You >> >> can >> >> >>> >> also >> >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if >> it's >> >> one >> >> >>> >> you're >> >> >>> >> >> going to be using a lot, or if there is custom configuration >> >> >>> information >> >> >>> >> >> required to use the InputFormat. >> >> >>> >> >> >> >> >>> >> >> J >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban < >> >> >>> >> victor.iacoban@gmail.com >> >> >>> >> >> >wrote: >> >> >>> >> >> >> >> >>> >> >> > That's exactly what I have in the code not using Crunch API: >> >> >>> >> >> > public class MultiSequenceFileInputFormat extends >> >> >>> >> >> > CombineFileInputFormat { >> >> >>> >> >> > ... >> >> >>> >> >> > } >> >> >>> >> >> > >> >> >>> >> >> > Are you saying there is way to use my custom input format >> with >> >> >>> Crunch? >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills < >> >> josh.wills@gmail.com> >> >> >>> >> >> wrote: >> >> >>> >> >> > >> >> >>> >> >> > > Depends on the size of the files-- if there are a bunch of >> >> tiny >> >> >>> >> ones, >> >> >>> >> >> it >> >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala >> >> >>> >> >> > > >> >> >>> >> >> > > >> >> >>> >> >> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html >> >> >>> >> >> > > >> >> >>> >> >> > > J >> >> >>> >> >> > > >> >> >>> >> >> > > >> >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban < >> >> >>> >> >> > victor.iacoban@gmail.com >> >> >>> >> >> > > >wrote: >> >> >>> >> >> > > >> >> >>> >> >> > > > Thanks Josh, >> >> >>> >> >> > > > Is there any performance penalty in unions, assuming >> that I >> >> >>> have >> >> >>> >> >> > several >> >> >>> >> >> > > > hundreds of input files? >> >> >>> >> >> > > > >> >> >>> >> >> > > > >> >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills < >> >> >>> josh.wills@gmail.com >> >> >>> >> > >> >> >>> >> >> > > wrote: >> >> >>> >> >> > > > >> >> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work. >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > PTable first = pipeline.read(new TableSource> >> >>> >> >> V>(firstFile)); >> >> >>> >> >> > > > > PTable second = ...; >> >> >>> >> >> > > > > PTable union = first.union(second); >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > etc. >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban < >> >> >>> >> >> > > > victor.iacoban@gmail.com >> >> >>> >> >> > > > > >wrote: >> >> >>> >> >> > > > > >> >> >>> >> >> > > > > > Is there any support in crunch to use multiple >> sequence >> >> >>> files >> >> >>> >> as >> >> >>> >> >> > > > pipeline >> >> >>> >> >> > > > > > source? >> >> >>> >> >> > > > > > something similar to standard MultipleInputs >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > > Thanks, >> >> >>> >> >> > > > > > victor >> >> >>> >> >> > > > > > >> >> >>> >> >> > > > > >> >> >>> >> >> > > > >> >> >>> >> >> > > >> >> >>> >> >> > >> >> >>> >> >> >> >> >>> >> >> >> >>> >> >> >> > >> > >> > >> > -- >> > Director of Data Science >> > Cloudera >> > Twitter: @josh_wills >>