commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benedikt Ritter <brit...@apache.org>
Subject Re: [CSV] CSV-58 is killing me
Date Thu, 02 May 2013 17:26:17 GMT
2013/5/2 sebb <sebbaz@gmail.com>

> On 2 May 2013 14:56, sebb <sebbaz@gmail.com> wrote:
>
> > On 2 May 2013 09:48, Benedikt Ritter <britter@apache.org> wrote:
> >
> >> 2013/5/1 sebb <sebbaz@gmail.com>
> >>
> >> > On 1 May 2013 08:53, Benedikt Ritter <britter@apache.org> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I have tried to solve CSV-58 - Escape handling needs rethinking [1]
> a
> >> few
> >> > > times over the past weeks now. But I can not see a way to get this
> >> > working.
> >> > > There are two problems with our current implementation:
> >> > >
> >> > > 1. a sequence of " test \a test" will be parsed to "test a test",
> >> > although
> >> > > there is no reason to escape 'a'. The expected behavior is to let
> the
> >> > > sequence \a unchanged.
> >> > >
> >> >
> >> > An alternative would be to throw an Exception.
> >> >
> >>
> >> Haven't thought about that alternative so far. It is was the java
> compiler
> >> would do if it finds "\a" in a String.
> >> If we decide to through an exception, the message should state very
> >> clearly
> >> what has gone wrong, including the position of the misplaced escape
> char.
> >> I'd like to hear more opinions on this.
> >>
> >>
> > Throwing an exception stops the parsing, and obviously only catches the
> > first error.
> > Leaving the sequences as is means potentially having to scan a large
> > output file to find them all.
> >
> > There's another option: invoke a callback when any invalid sequences are
> > detected.
> >
> > That would allow for various possibilities, depending on what callback
> was
> > provided.
> > - throw Exception with details
> > - ignore (with optional log)
> > - replace with some other sequence (with optional log)
> >
> > I don't think CSV should do any logging of its own, but the caller can do
> > so in the callback
> >
> > However we would need to decide what the default callback should do.
> > If we throw an Exception, then it is obvious that something is wrong.
> > If we ignore the sequence (leave as is) then there is no easy way for the
> > caller to know if there have been any errors.
> >
> >
> >> >
> >> >
> >> > > 2. If the escape char is different from backslash (for example
> '!'), a
> >> > > sequence of "test !r test" will be parsed to "test CR test"
> >> > >
> >> > >
> >> > Is that how other CSV parsers behave? i.e. do they always use \ for
> >> > escaping EOL?
> >> >
> >>
> >> Haven't tested this. I'll try to have a look this weekend.
> >>
> >> The problem with the current impl is a backslash as part of an escape
> >> sequence.
> >> This is what readEscape() currently does: It checks if the lexer has
> >> reached a character combination of (for example) \r and replaces this
> with
> >> CR. Otherwise it swallows the escape character and returns the next
> >> character.
> >>
> >> There are two problems with this:
> >> 1. it does so if it recognizes the escape character (which doesn't have
> to
> >> be a backslash). This is why !r gets replaced by CR.
> >> 2. it swallows the escape and returns the next character (default branch
> >> in
> >> the switch statement). This should only happen if the next character is
> >> one
> >> of the special characters from the format (e.g. the delimiter). This is
> >> why
> >> '\a' gets replaced by 'a'
> >>
> >
> > I think that's wrong, because one cannot distinguish "\a" from a".
> >
> > The readEscape() method currently returns an int; this cannot be used to
> > distinguish between a valid escape and some other character.
> > If we only want to support throwing an Exception for invalid escape
> > sequence, then it's trivial to fix this - just throw the Exception.
> >
> > However, to support "as is" passthrough of invalid sequences, the caller
> > needs to know when to restore the escape character to the output.
> >
> > One way to do this would be to change the return type to a char[] array.
> > This would either be a single char or the escape followed by a single
> char.
> > The return array could just be appended to the buffer. This should work
> > quite well with a call-back, as the callback could also return a char
> array.
> >
> > The only downside I can see is that append(char[]) may be slightly slower
> > than append(char).
> >
> >
> Found another way to do it: return -1 (EOF) which cannot otherwise be
> returned by the method.
> The last character is then accessible via in.getLastChar().
> i.e. if -1 is returned, output escape followed by lastChar.
>
>
> Needs some tweaks to readEscape, because \LF etc. stand for themselves,
> i.e. both \r and \CR => CR.
>
> What needs to be decided is whether both !r and !CR => CR if the escape
> character is ! rather than \, and if so, whether \CR still stands for CR.
>
> But I think the parsing can be tweaked without too much rewriting.
>

I think there are four cases to cover (I'm using ! as escape in this
example):
1. !r => nothing to escape, append '!r'
2. !\r => the CR character in its character representation. Replace '\r' by
CR and escape it. continue parsing of the token
3. !CR => esacpe CR and contine parsing of the token
4. !\CR => escape '\' if it is one of the characters used in the format, in
other words: remove the '!' and append the '\'. Then EOL => next record,
but only if CR is the record separator (?).

I've tried several times and failed. Do you have the time to implement your
proposal? I'll probably have some time this weekend to try once again :-)


>
>
> > An alternative is to throw an exception.
> >>
> >> I think it may be necessary to separate backslash handling from escape
> >> handling.
> >>
> >>
> >>
> > Possibly. However that may make parsing harder.
> >
> >
> >> >
> >> >
> >> > > I have tried different approaches to handle this issue but there are
> >> > always
> >> > > corner cases that I'm unable to cover. I'm getting the feeling that
> >> I'm
> >> > > just to dump to solve this, so any comments or suggestions would be
> >> > really
> >> > > appreciated!
> >> > >
> >> > >
> >> > Perhaps it would help to commit the unit tests you are working to
> solve.
> >> > The tests can be disabled.
> >> >
> >> > Or perhaps better use a class whose name is not one of the ones picked
> >> up
> >> > by Surefire.
> >> > This allows the test to be run independently by Eclipse (and Maven)
> but
> >> > won't mess with CI builds.
> >> >
> >>
> >> I have already committed three tests to CSVLexerTest and set them to be
> >> ignored. Maybe you can review those and tell me whether my assumptions
> are
> >> correct.
> >>
> >>
> >> >
> >> > Benedikt
> >> > > [1] https://issues.apache.org/jira/browse/CSV-58
> >> > >
> >> > > --
> >> > > http://people.apache.org/~britter/
> >> > > http://www.systemoutprintln.de/
> >> > > http://twitter.com/BenediktRitter
> >> > > http://github.com/britter
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> http://people.apache.org/~britter/
> >> http://www.systemoutprintln.de/
> >> http://twitter.com/BenediktRitter
> >> http://github.com/britter
> >>
> >
> >
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message