subversion-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Branko Čibej <>
Subject Re: SVN Blame Returns Corrupt Data
Date Fri, 11 Oct 2013 19:29:47 GMT
On 11.10.2013 19:25, Stefan Sperling wrote:
> On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote:
>> On 10/11/13 9:22 AM, Branko Čibej wrote:
>>> You'd have to extend Subversion's file type detection to detect UTF-16.
>>> See svn_io_detect_mimetype2 in line 3333 in this file:
>>> Subversion currently only looks at the first 1k Bytes of a file. It may
>>> be enough to check that this initial part of the file contains only
>>> valid UTF-16 (BE or LE) codes.
>> Even if all we looked for is the BOM it might be helpful enough.  I suspect the
>> development tools producing UTF-16 are including BOMs.  Windows seems to be
>> fond of including them, Notepad puts one even on UTF-8.
> Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
> processing them for diff/merge/blame, and convert output written to
> the original files back to UTF-16?

That would be less work than supporting whitespace compression, etc. in
UTF-16, but we'd still have to detect U+2424 as an end-of-line marker in
UTF-8 text.

Still, we'd actually have to correctly identify UTF-16 content first,
and handle invalid byte sequences.

> That would require some work because existing streams, strings, and files
> passed around in the code would need to be wrapped so that translation
> to/from the internal from/to the external encoding is seamless.
> But I don't see why such an approach couldn't be made to work in principle.
> It might even result in some spring cleaning in the code base and pave the
> way for improved handling of file formats such as XML for diff and merge.

Can't see what XML has to do with it. The diff algorithm already uses a
tokenizer; and for XML, that should be good enough most of the time.

> What do you think? Is it worth adding this to our project ideas page?

It's already here:

-- Brane

Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data

View raw message