drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jaltekruse <...@git.apache.org>
Subject [GitHub] drill pull request: DRILL-4203: fix dates written into parquet fil...
Date Wed, 27 Jan 2016 18:56:09 GMT
GitHub user jaltekruse opened a pull request:


    DRILL-4203: fix dates written into parquet files to conform to parquet format spec

    This branch includes an update of the version number to 1.5.0, this is required because
we need a hard release to signal that all future parquet files are not corrupted. Without
this change the fixed files written by the writer would still be considered corrupt (as all
of the rest of the files generated with earlier commits with the version 1.5.0-SNAPSHOT will
actually be corrupted). This commit can be removed/amended when the changes are merged, but
this patch should be immediately followed by a change of the version number to avoid the risk
of generating files with corrected date values, but a version number that will tell the reader
to still shift the dates.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jaltekruse/incubator-drill 4203-parquet-dates-bug-squash2

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #341
commit 3cbbe1c418ec8e802144f6cba1d88ede9de7f930
Author: Jason Altekruse <altekrusejason@gmail.com>
Date:   2015-12-31T16:22:04Z

    DRILL-4203: Fix date values written in parquet files created by Drill
    Drill was writing non-standard dates into parquet files for all releases
    before 1.5.0. The values have been read by Drill correctly by Drill, but
    external tools like Spark reading the files will see corrupted values for
    all dates that have been written by Drill.
    This change corrects the behavior of the Drill parquet writer to correctly
    store dates in the format given in the parquet specification.
    To maintain compatibility with old files, the parquet reader code has
    been updated to check for the old format and automatically shift the
    corrupted values into corrected ones automatically.
    The test cases included here should ensure that all files produced by
    historical versions of Drill will continue to return the same values they
    had in previous releases. For compatibility with external tools, any old
    files with corrupted dates can be re-written using the CREATE TABLE AS
    command (as the writer will now only produce the specification-compliant
    values, even if after reading out of older corrupt files).
    While the old behavior was a consistent shift into an unlikely range
    to be used in a modern database (over 10,000 years in the future), these are still
    valid date values. In the case where these may have been written into
    files intentionally, and we cannot be certain from the metadata if Drill
    produced the files, an option is included to turn off the auto-correction.
    Use of this option is assumed to be extremely unlikely, but it is included
    for completeness.

commit 9a3f3b8a3d599d3e8981c7b987f229809db8eec4
Author: Jason Altekruse <altekrusejason@gmail.com>
Date:   2016-01-27T18:20:01Z

    Fix DrillVersionInfo to make it provide a valid version number even during
    the unit tests.
    This is now a build-time generated class, rather than one that looks on the
    classpath for META-INF files.
    This pattern for file generation with parameters passed from the POM files
    was borrowed from parquet-mr.

commit fb4bc2271c625dd25729575fc77f117b2c1d0a72
Author: Jason Altekruse <altekrusejason@gmail.com>
Date:   2016-01-26T04:19:24Z

    Changing version of Drill to 1.5.0
    This isn't actually the 1.5.0 release, but the primary condition used
    to identify if corrected dates are stored in a parquet file is the
    Drill version included in the metadata. This version number is retrieved
    from the META-INF in the drill jar. This version number change is needed
    to make some of the regression tests pass, otherwise the 1.5.0-SNAPSHOT
    version will make the tests assume that the files are corrupt (as all
    commits before this one were writing corrupt dates).


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message