pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Doo <michael....@verve.com>
Subject Re: Reading partitioned Parquet data into Pig
Date Fri, 31 Aug 2018 12:19:21 GMT

The Parquet Pig loader is fine if all the data is present, but if I've written out from Spark
using `df.write.partitionBy('colA', 'colB').parquet('s3://path/to/output')`, the data from
those two columns are put into the output path and taken out from the data: s3://path/to/output/colA=valA/colB=valB/part-0001.parquet.
There are hacky workarounds, such as duplicating the columns in Spark before writing, which
fix the issue of loading into Pig but then mean they re-appear in the data when you read back
into Spark.


On 8/30/18, 10:15 AM, "Adam Szita" <szita@cloudera.com.INVALID> wrote:

    Hi Eyal,
    For just loading Parquet files the Parquet Pig loader is okay, although I
    don't think it lets you use partition values in the dataset later.
    I know the plain old PigStorage has a trick with -tagFiles option but not
    sure if that'd be enough in Michael's case and also if that's something
    Parquet Loader supports.
    On Thu, 30 Aug 2018 at 16:10, Eyal Allweil <eyal_allweil@yahoo.com.invalid>
    > Hi Michael,
    > You can also use the Parquet Pig loader (especially if you're not working
    > with Hive). Here's a link to the Maven repository for it.
    > https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0
    > Regards,Eyal
    > <https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0Regards,Eyal>
    >    On Tuesday, August 28, 2018, 2:40:36 PM GMT+3, Adam Szita
    > <szita@cloudera.com.INVALID> wrote:
    >  Hi Michael,
    > Yes you can use HCatLoader to do this.
    > The requirement is that you have a Hive table defined on top of your data
    > (probably pointing to s3://path/to/files) (and Hive MetaStore has all the
    > relevant meta/schema information).
    > If you do not have a Hive table yet, you can go ahead and define it in Hive
    > by manually specifying schema information, and after that partitions can be
    > added automatically via the 'msck repair' function of Hive.
    > Hope this helps,
    > Adam
    > On Mon, 27 Aug 2018 at 19:18, Michael Doo <michael.doo@verve.com> wrote:
    > > Hello,
    > >
    > > I’m trying to read in Parquet data into Pig that is partitioned (so it’s
    > > stored in S3 like
    > >
    > s3://path/to/files/some_flag=true/part-00095-a2a6230b-9750-48e4-9cd0-b553ffc220de.c000.gz.parquet).
    > > I’d like to load it into Pig and add the partitions as columns. I’ve read
    > > some resources suggesting using the HCatLoader, but so far haven’t had
    > > success.
    > >
    > > Any advice would be welcome.
    > >
    > > ~ Michael
    > >

View raw message