commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Bruun-Hansen (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (CSV-253) Handle absent values in input (null)
Date Tue, 29 Oct 2019 15:45:00 GMT

    [ https://issues.apache.org/jira/browse/CSV-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962105#comment-16962105
] 

Lars Bruun-Hansen edited comment on CSV-253 at 10/29/19 3:44 PM:
-----------------------------------------------------------------

[~ggregory]  Sorry, the whole point of the PR-51 is that {{nullString}} cannot handle the
issue at hand. {{nullString}} feature indeed fulfills a different purpose. Something else
is required. 

Example:

The aim is to parse the following CSV:
{noformat}
"John",,""{noformat}
 

What happens when using the {{nullString}} feature to tackle the problem is summarized below:
||Setting||element1||element2||element3||
|<expected result>|"John"|null|""|
|with nullString = null|"John"|""|""|
|with nullString = ""|"John"|null|null|

As can be seen, there is no way to achieve the desired result. This is essentially because
Apache CSV at the moment has no concept of what I call an _absent value_. To the Lexer, element2
and element3 have the same value. They dont!

With the PR the parser becomes aware of the difference between element2 and element3.

You can also see [this question|https://stackoverflow.com/questions/34734125/apache-common-csvparser-csvrecord-to-return-null-for-empty-fields]
on SO. In one of the answers, the Apache CSV library is getting lamented for not being able
to handle this situation. This is unfortunately correct.

 
h3. Why two settings?

Of course there's a certain conceptual overlap between the proposed new setting on formatter,
{{absentIsNull}}, and the existing {{nullString}} and if the library was designed  again
from scratch then they could probably be conflated. But now we have the history, and the way
{{nullString}} works cannot be touched as it would break backwards compatibility. Also I believe
99.9% percent of users of the library would actually want to parse an absent value as null,
but I don't dare to propose that as a new default as it would break backwards compatibility.
Hence, I propose a new setting on Formatter and I propose it to be an opt-in feature.

 

 


was (Author: lbruun):
[~ggregory]  Sorry, the whole point of the PR-51 is that {{nullString}} cannot handle the
issue at hand. {{nullString}} feature indeed fulfills a different purpose. Something else
is required. 

Example:

The aim is to parse the following CSV:
{noformat}
"John",,""{noformat}
 

What happens when using the {{nullString}} feature to tackle the problem is summarized below:
||Setting||element1||element2||element3||
|<expected result>|"John"|null|""|
|with nullString = null|"John"|""|""|
|with nullString = ""|"John"|null|null|

As can be seen, there is no way to achieve the desired result. This is essentially because
Apache CSV at the moment has no concept of what I call an _absent value_. To the Lexer, element2
and element3 have the same value. They dont!

With the PR the parser becomes aware of the difference between element2 and element3.

You can also see [this question|https://stackoverflow.com/questions/34734125/apache-common-csvparser-csvrecord-to-return-null-for-empty-fields]
on SO. In one of the answers, the Apache CSV library is getting lamented for not being able
to handle this situation. This is unfortunately correct.

 
h3. Why two settings?

Of course there's a certain conceptual overlap between the proposed new setting on formatter,
{{absentIsNull}}, and the existing {{nullString}} and if the library was designed from again
scratch then they could probably be conflated. But now we have the history, and the way {{nullString}}
works cannot be touched as it would break backwards compatibility. Also I believe 99.9% percent
of users of the library would actually want to parse an absent value as null, but I don't
dare to propose that as a new default as it would break backwards compatibility. Hence, I
propose a new setting on Formatter and I propose it to be an opt-in feature.

 

 

> Handle absent values in input (null)
> ------------------------------------
>
>                 Key: CSV-253
>                 URL: https://issues.apache.org/jira/browse/CSV-253
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>            Reporter: Lars Bruun-Hansen
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The parser must be able to handle absent values in input and translate that into {{null}}
as required. I see several tickets on this matter in the history, but none seem to have addressed
the issue, at least not for parsing. 
> For this problem, I see a need to introduce a new term:
> Definition: _Absent value_ is when there are zero characters between field delimiters.
> Specifically the aim is to be able to parse the following:
> {noformat}
>     "John",,"Doe"    // 2nd element is absent
>     ,"AA",123        // 1st element is absent
>     "John",90,       // 3rd element is absent
>     "",,90           // 2nd element is absent (1st element isn't)
> {noformat}
>  
> See also CSV-93 which I think never addressed the issue, probably because the reporter
was happy with having the issue fixed for CSV output, not for parsing.
> A PR is coming...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message