spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: hadoop input/output format advanced control
Date Tue, 24 Mar 2015 22:40:43 GMT
i would like to use objectFile with some tweaks to the hadoop conf.
currently there is no way to do that, except recreating objectFile myself.
and some of the code objectFile uses i have no access to, since its private
to spark.


On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell <pwendell@gmail.com> wrote:

> Yeah - to Nick's point, I think the way to do this is to pass in a
> custom conf when you create a Hadoop RDD (that's AFAIK why the conf
> field is there). Is there anything you can't do with that feature?
>
> On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
> <nick.pentreath@gmail.com> wrote:
> > Imran, on your point to read multiple files together in a partition, is
> it
> > not simpler to use the approach of copy Hadoop conf and set per-RDD
> > settings for min split to control the input size per partition, together
> > with something like CombineFileInputFormat?
> >
> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid <irashid@cloudera.com>
> wrote:
> >
> >> I think this would be a great addition, I totally agree that you need
> to be
> >> able to set these at a finer context than just the SparkContext.
> >>
> >> Just to play devil's advocate, though -- the alternative is for you just
> >> subclass HadoopRDD yourself, or make a totally new RDD, and then you
> could
> >> expose whatever you need.  Why is this solution better?  IMO the
> criteria
> >> are:
> >> (a) common operations
> >> (b) error-prone / difficult to implement
> >> (c) non-obvious, but important for performance
> >>
> >> I think this case fits (a) & (c), so I think its still worthwhile.  But
> its
> >> also worth asking whether or not its too difficult for a user to extend
> >> HadoopRDD right now.  There have been several cases in the past week
> where
> >> we've suggested that a user should read from hdfs themselves (eg., to
> read
> >> multiple files together in one partition) -- with*out* reusing the code
> in
> >> HadoopRDD, though they would lose things like the metric tracking &
> >> preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
> >> refactoring to make that easier to do?  Or do we just need a good
> example?
> >>
> >> Imran
> >>
> >> (sorry for hijacking your thread, Koert)
> >>
> >>
> >>
> >> On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <koert@tresata.com>
> wrote:
> >>
> >> > see email below. reynold suggested i send it to dev instead of user
> >> >
> >> > ---------- Forwarded message ----------
> >> > From: Koert Kuipers <koert@tresata.com>
> >> > Date: Mon, Mar 23, 2015 at 4:36 PM
> >> > Subject: hadoop input/output format advanced control
> >> > To: "user@spark.apache.org" <user@spark.apache.org>
> >> >
> >> >
> >> > currently its pretty hard to control the Hadoop Input/Output formats
> used
> >> > in Spark. The conventions seems to be to add extra parameters to all
> >> > methods and then somewhere deep inside the code (for example in
> >> > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
> >> into
> >> > settings on the Hadoop Configuration object.
> >> >
> >> > for example for compression i see "codec: Option[Class[_ <:
> >> > CompressionCodec]] = None" added to a bunch of methods.
> >> >
> >> > how scalable is this solution really?
> >> >
> >> > for example i need to read from a hadoop dataset and i dont want the
> >> input
> >> > (part) files to get split up. the way to do this is to set
> >> > "mapred.min.split.size". now i dont want to set this at the level of
> the
> >> > SparkContext (which can be done), since i dont want it to apply to
> input
> >> > formats in general. i want it to apply to just this one specific input
> >> > dataset i need to read. which leaves me with no options currently. i
> >> could
> >> > go add yet another input parameter to all the methods
> >> > (SparkContext.textFile, SparkContext.hadoopFile,
> SparkContext.objectFile,
> >> > etc.). but that seems ineffective.
> >> >
> >> > why can we not expose a Map[String, String] or some other generic way
> to
> >> > manipulate settings for hadoop input/output formats? it would require
> >> > adding one more parameter to all methods to deal with hadoop
> input/output
> >> > formats, but after that its done. one parameter to rule them all....
> >> >
> >> > then i could do:
> >> > val x = sc.textFile("/some/path", formatSettings =
> >> > Map("mapred.min.split.size" -> "12345"))
> >> >
> >> > or
> >> > rdd.saveAsTextFile("/some/path, formatSettings =
> >> > Map(mapred.output.compress" -> "true",
> "mapred.output.compression.codec"
> >> ->
> >> > "somecodec"))
> >> >
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message