spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Deaver <mattrdea...@gmail.com>
Subject Re: Merging Schema while reading Parquet files
Date Tue, 21 Mar 2017 15:13:51 GMT
You could create a one-time job that processes historical data to match the
updated format

On Tue, Mar 21, 2017 at 8:53 AM, Aditya Borde <bordecorp@gmail.com> wrote:

> Hello,
>
> I'm currently blocked with this issue:
>
> I have job "A" whose output is partitioned by one of the field - "col1"
> Now job "B" reads the output of job "A".
>
> Here comes the problem. my job "A" output previously not been partitioned
> by "col1" (this is recent change).
> But the thing is now, all my previous data has not been partitioned by
> "col1" for job "A".
> If I want to run my job "B" without any issue with previous as well as
> current data - it is failing as because : "inconsistent partition column
> names"
>
> *Reading Path is something like - "file://path1/name/sample/"* ---> but
> further it has directories *"day=2017-02-15/filling=5/xyz1"*
>
> Currently it is generating one more deeper directory input path --> "
> */day=2017-02-15/filling=5/col1/xyz2"*
>
> "mergeSchema" - is not working here because my base path has multiple
> directories under which files are residing.
>
> Can someone suggest me some effective solution here?
>
> Regards,
> Aditya Borde
>



-- 
Regards,

Matt
Data Engineer
https://www.linkedin.com/in/mdeaver
http://mattdeav.pythonanywhere.com/

Mime
View raw message