jena-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Vesse (JIRA)" <j...@apache.org>
Subject [jira] Created: (JENA-12) Turtle Files with a UTF-8 BOM fail to parse
Date Sat, 18 Dec 2010 13:06:00 GMT
Turtle Files with a UTF-8 BOM fail to parse
-------------------------------------------

                 Key: JENA-12
                 URL: https://issues.apache.org/jira/browse/JENA-12
             Project: Jena
          Issue Type: Bug
          Components: RIOT
         Environment: Windows 7, latest Sun Java Runtime, Jena 2.6.4
            Reporter: Rob Vesse


If a Turtle file has a BOM at the start then Jena will refuse to parse it giving the following
error:

Exception in thread "main" com.hp.hpl.jena.n3.turtle.TurtleParseException: Lexical error at
line 1, column 2.  Encountered: "@" (64), after : "\ufeff"
    at com.hp.hpl.jena.n3.turtle.ParserTurtle.parse(ParserTurtle.java:44)
    at com.hp.hpl.jena.n3.turtle.TurtleReader.readWorker(TurtleReader.java:21)
    at com.hp.hpl.jena.n3.JenaReaderBase.readImpl(JenaReaderBase.java:101)
    at com.hp.hpl.jena.n3.JenaReaderBase.read(JenaReaderBase.java:68)
    at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:226)
    at TurtleWithBOM.main(TurtleWithBOM.java:31)

The code I used to produce this error was as follows:

import com.hp.hpl.jena.rdf.model.*;
import com.hp.hpl.jena.util.FileManager;

import java.io.*;

public class TurtleWithBOM
{

    public static void main(String[] args)
    {

        // create an empty model
        Model model = ModelFactory.createDefaultModel();

        InputStream in = FileManager.get().open( "ttl-with-bom.ttl" );
        if (in == null)
            {
            throw new IllegalArgumentException( "File: ttl-with-bom.ttl not found");
        }

        // read the Turtle file
        model.read(in, "", "TTL");

        // write it to standard out
        model.write(System.out);
    }
}

A sample Turtle file used with the above code can be found attached to the original report
to the Jena Users mailing list here - http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201012.mbox/%3CEMEW3|b0e33a3dc6849ef75f49c8891480853dmBGBgv06rav08r|ecs.soton.ac.uk|c9ad8cb3882263b3dc55f8c2a5b1a40f@ecs.soton.ac.uk%3E

The data files are coming from my software which is all written in .Net and when outputting
in UTF-8 the default behaviour of .Net is to include the BOM at the start of the file. The
BOM is not required for UTF-8 but it is not forbidden so I think this should be fixed (if
possible) for future releases. I will be modifying my software so that output of the BOM can
be disabled by my users if desired 

Looking at the error message given I expect that the same problem would also affect N3 files
since they are using the same reader afaict from the error trace. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message