spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: hadoop input/output format advanced control
Date Thu, 26 Mar 2015 04:36:57 GMT
Should we mention that you should synchronize
on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race
condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :)

On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell <pwendell@gmail.com> wrote:

> Great - that's even easier. Maybe we could have a simple example in the
> doc.
>
> On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza <sandy.ryza@cloudera.com>
> wrote:
> > Regarding Patrick's question, you can just do "new
> Configuration(oldConf)"
> > to get a cloned Configuration object and add any new properties to it.
> >
> > -Sandy
> >
> > On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid <irashid@cloudera.com>
> wrote:
> >
> >> Hi Nick,
> >>
> >> I don't remember the exact details of these scenarios, but I think the
> user
> >> wanted a lot more control over how the files got grouped into
> partitions,
> >> to group the files together by some arbitrary function.  I didn't think
> >> that was possible w/ CombineFileInputFormat, but maybe there is a way?
> >>
> >> thanks
> >>
> >> On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath <
> nick.pentreath@gmail.com>
> >> wrote:
> >>
> >> > Imran, on your point to read multiple files together in a partition,
> is
> >> it
> >> > not simpler to use the approach of copy Hadoop conf and set per-RDD
> >> > settings for min split to control the input size per partition,
> together
> >> > with something like CombineFileInputFormat?
> >> >
> >> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid <irashid@cloudera.com>
> >> > wrote:
> >> >
> >> > > I think this would be a great addition, I totally agree that you
> need
> >> to
> >> > be
> >> > > able to set these at a finer context than just the SparkContext.
> >> > >
> >> > > Just to play devil's advocate, though -- the alternative is for you
> >> just
> >> > > subclass HadoopRDD yourself, or make a totally new RDD, and then you
> >> > could
> >> > > expose whatever you need.  Why is this solution better?  IMO the
> >> criteria
> >> > > are:
> >> > > (a) common operations
> >> > > (b) error-prone / difficult to implement
> >> > > (c) non-obvious, but important for performance
> >> > >
> >> > > I think this case fits (a) & (c), so I think its still worthwhile.
> But
> >> > its
> >> > > also worth asking whether or not its too difficult for a user to
> extend
> >> > > HadoopRDD right now.  There have been several cases in the past week
> >> > where
> >> > > we've suggested that a user should read from hdfs themselves (eg.,
> to
> >> > read
> >> > > multiple files together in one partition) -- with*out* reusing the
> code
> >> > in
> >> > > HadoopRDD, though they would lose things like the metric tracking
&
> >> > > preferred locations you get from HadoopRDD.  Does HadoopRDD need to
> >> some
> >> > > refactoring to make that easier to do?  Or do we just need a good
> >> > example?
> >> > >
> >> > > Imran
> >> > >
> >> > > (sorry for hijacking your thread, Koert)
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <koert@tresata.com>
> >> > wrote:
> >> > >
> >> > > > see email below. reynold suggested i send it to dev instead of
> user
> >> > > >
> >> > > > ---------- Forwarded message ----------
> >> > > > From: Koert Kuipers <koert@tresata.com>
> >> > > > Date: Mon, Mar 23, 2015 at 4:36 PM
> >> > > > Subject: hadoop input/output format advanced control
> >> > > > To: "user@spark.apache.org" <user@spark.apache.org>
> >> > > >
> >> > > >
> >> > > > currently its pretty hard to control the Hadoop Input/Output
> formats
> >> > used
> >> > > > in Spark. The conventions seems to be to add extra parameters
to
> all
> >> > > > methods and then somewhere deep inside the code (for example
in
> >> > > > PairRDDFunctions.saveAsHadoopFile) all these parameters get
> >> translated
> >> > > into
> >> > > > settings on the Hadoop Configuration object.
> >> > > >
> >> > > > for example for compression i see "codec: Option[Class[_ <:
> >> > > > CompressionCodec]] = None" added to a bunch of methods.
> >> > > >
> >> > > > how scalable is this solution really?
> >> > > >
> >> > > > for example i need to read from a hadoop dataset and i dont want
> the
> >> > > input
> >> > > > (part) files to get split up. the way to do this is to set
> >> > > > "mapred.min.split.size". now i dont want to set this at the level
> of
> >> > the
> >> > > > SparkContext (which can be done), since i dont want it to apply
to
> >> > input
> >> > > > formats in general. i want it to apply to just this one specific
> >> input
> >> > > > dataset i need to read. which leaves me with no options
> currently. i
> >> > > could
> >> > > > go add yet another input parameter to all the methods
> >> > > > (SparkContext.textFile, SparkContext.hadoopFile,
> >> > SparkContext.objectFile,
> >> > > > etc.). but that seems ineffective.
> >> > > >
> >> > > > why can we not expose a Map[String, String] or some other generic
> way
> >> > to
> >> > > > manipulate settings for hadoop input/output formats? it would
> require
> >> > > > adding one more parameter to all methods to deal with hadoop
> >> > input/output
> >> > > > formats, but after that its done. one parameter to rule them
> all....
> >> > > >
> >> > > > then i could do:
> >> > > > val x = sc.textFile("/some/path", formatSettings =
> >> > > > Map("mapred.min.split.size" -> "12345"))
> >> > > >
> >> > > > or
> >> > > > rdd.saveAsTextFile("/some/path, formatSettings =
> >> > > > Map(mapred.output.compress" -> "true",
> >> > "mapred.output.compression.codec"
> >> > > ->
> >> > > > "somecodec"))
> >> > > >
> >> > >
> >> >
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message