crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikael Goldmann (JIRA)" <>
Subject [jira] [Created] (CRUNCH-601) Short PCollections is SparkPipeline get length null.
Date Tue, 12 Apr 2016 16:22:25 GMT
Mikael Goldmann created CRUNCH-601:

             Summary: Short PCollections is SparkPipeline get length null.
                 Key: CRUNCH-601
             Project: Crunch
          Issue Type: Bug
          Components: Spark
    Affects Versions: 0.13.0
         Environment: Running in local mode on Mac as well as in a ubuntu 14.04 docker container
            Reporter: Mikael Goldmann
            Priority: Minor

I'll attach a file with a test that I would expect to pass but which fails.

It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the lengths, runs
the pipeline and prints the lengths. Finally it asserts that all lengths are non-null.

I would expect it to print lengths 0, 1, 2, 3, 4 and pass.

What it does is print lengths null, null, null, 3, 4 and fail.

I think the underlying reason is the user of getSize() on an unmaterialized object and assuming
that when the estimate that getSize() returns is 0, then the PCollection is guaranteed to
be empty, which is false in some cases.

This message was sent by Atlassian JIRA

View raw message