xmlgraphics-fop-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fung Cheung <fche...@indeed.com.INVALID>
Subject Double byte Unicode Char incorrect when parsed
Date Sat, 15 Feb 2020 01:11:06 GMT
Hello,

We have been using FOP for html to pdf generation, and recently I noticed
that when unicode chars are included, the pdf output has parsing issues.

How we use it (fop 2.4):
- We have html string
- final InputSource source = new InputSource(new
ByteArrayInputStream(htmlString.getBytes()));

FopFactoryBuilder builder = new
FopFactoryBuilder(URI.create(resourceLoader.getResource(resourceBasePath).getURI().toString()),
new ClasspathResolverURIAdapter());
builder.setConfiguration(configuration);

FopFactory factory = builder.build();
userAgent = factory.newFOUserAgent();
userAgent.setAuthor("Indeed");
userAgent.setCreator("Indeed");
userAgent.setTitle("Indeed");
userAgent.setKeywords("Indeed");

fop = factory.newFop(MimeConstants.MIME_PDF, userAgent, outputStream);

// Setup CSSToXSLFo as transform the XHTML output into xml:fo
final URL baseUrl = resourceLoader.getResource(resourceBasePath).getURL();
Loggers.debug(LOGGER, "Parsing HTML response using base URL '%s'", baseUrl);
final XMLReader xmlParser = Util.getParser(null, isValidatingParser);
final ProtectEventHandlerFilter eventHandlerFilter = new
ProtectEventHandlerFilter(true, true, xmlParser);

final XMLReader filter =
        new CSSToXSLFOFilter(
                baseUrl,
                null,
                Collections.EMPTY_MAP,
                eventHandlerFilter,
                cssToXslFoDebugEnabled);

filter.setEntityResolver(classPathEntityResolver);
filter.setContentHandler(fop.getDefaultHandler());
filter.parse(source);


This is able to produce a PDF with all the right displayed chars. As in, it
looks correct to a human.

We have a use case of reading it programatically. We are testing it out
with selecting the text in Adobe Reader, copying and pasting it. This
output is the same as parsing tools like pdftotext & pdfbox.

However, when there are many unicode chars, 3 things happen when we copy:
1) some unicode chars are copied as some other random chars
e.g. source: πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚ πŸƒ‹πŸƒ‹πŸƒ‹πŸƒ‹πŸƒ‹ πŸƒ‹πŸƒ‹πŸƒ‹ jack 𝍐𝍐𝍐 3 chars
π„žπ„ž 2 music
majhog : πŸ€€ πŸ€€ πŸ€€
copy output: πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚πŸ˜‚ 33333 333πŸƒ‹ jackπŸƒ‹ 555   charsπŸƒ‹ 𝍐 2 musicπŸƒ‹
majhog 8 3 3 3

2) Location on the chars
e.g. From the above example, the next page of the PDF does not have πŸƒ‹.
However when copying, it showed up "πŸƒ‹πŸƒ‹πŸƒ‹" somewhere on the next page.

3) Some fonts make corrupted PDF output. We were trying out Mathematical
fonts, e.g. "𝐏𝐫𝐨𝐟𝐒π₯𝐞
<https://www.fileformat.info/info/unicode/char/1d40f/fontsupport.htm>"
It was fixable by using the Symbola font embedding-mode="full", where a
correct looking PDF is produced. However, copying "𝐏𝐫𝐨𝐟𝐒π₯𝐞" gave
"퐏퐫퐨퐟퐒ν₯퐞". Upon comparing, the Capital P char is U+1D40F
<https://codepoints.net/U+1D40F>, and the corresponding Korean char is
U+D40F <https://codepoints.net/U+D40F>. The 1 in front of it is missing.

It was frustrating and I have googled everywhere. It seems to be related to
how Fop handles the toUnicodeCmap from a font file. I confirmed by
producing the PDF using Weasyprint (Python library), where all chars are
copy-able correctly.

Are we using FOP incorrectly? Are there tweaks we can do to fix it?

Thanks so much!

Mime
View raw message