lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-10934) create a link+anchor checker for the ref-guide PDF using PDFBox
Date Tue, 31 Oct 2017 00:38:00 GMT

     [ https://issues.apache.org/jira/browse/SOLR-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-10934:
----------------------------
    Attachment: SOLR-10934.patch



Ok, I'm attaching a really rough and dirty patch that includes:

* A quick and dirty CheckPDFLinksAndAnchors inspired by the SO post mentioned and the original
PrintURLs.java demo from pdfbox
* a build.xml 'nocommit' target to run it against our PDF
* some "broken" changes to our ref-guide content to deliberatey introduce a few errors...
*# anchor duplicated in multiple source pages
*# links to each of the diff dup anchors
*# link to an anchor that doesn't exist in the specified source doc, but does exist in a diff
doc
*# links to an source doc thta doesn't exist
*# links to an anchor that doesn't exist (in a source doc that does)

The results aren't promising...

# FAIL: the dup anchors cause asciidoctor to print a WARNING (even w/o any link checking)
that i'd forgotten about, but as far as i can tell from my exploration of the {{PDDocumentCatalog}}
that duplicated information is lost in the underlying PDF (or if it does make it into the
PDF, PDFBox loses it when parsing the PDF, because the "Catalog" is just a Map)
# FAIL: the PDF Annotations to each of the dup links both wind up mapping to the page with
the first occurange -- again: either because the catalog in the file can only track one location
for a given anchor, or because that's just how PDF Box deals with the precedence of dup dict
keys when reading the file
# FAIL: if an anchor doesn't exist in the specified source {{\*.adoc}} file, but does exist
somehwere else in the final PDF, then that's where asciidoctor points the generated link --
there's nothing weird about it i can detect from PDFBox
# GOOD: link's to a source {{\*.adoc}} file that doesn't actaully exist on disk are fairly
easy to detect -- asciidoctor's default behavior is to assume that these are links to other
docs that will be converted seperately, so they show up as "relative URIs" which we can treat
as a failure (ie: if a link in a PDF is to a non-absolute URI, it must be a content error)
# GOOD: link's to an anchor that doesn't exist are likewise easy to identify: the "annotation"
is preserved but has no destiation, which we can treat as a failure.

The important bits of the output w/this patch are included below...

{noformat}
-build-raw-pdf:
[asciidoctor:convert] Render SolrRefGuide-all.adoc from /home/hossman/lucene/dev/solr/build/solr-ref-guide/content/pdf
to /home/hossman/lucene/dev/solr/build/solr-ref-guide/pdf-tmp with backend=pdf
[asciidoctor:convert] asciidoctor: ERROR: about-this-guide.adoc: line 1: invalid part, must
have at least one section (e.g., chapter, appendix, etc.)
[asciidoctor:convert] asciidoctor: ERROR: solr-glossary.adoc: line 1: invalid part, must have
at least one section (e.g., chapter, appendix, etc.)
[asciidoctor:convert] asciidoctor: WARNING: errata.adoc: line 30: id assigned to section already
in use: nocommit_dup_anchor_name
[asciidoctor:convert] asciidoctor: ERROR: SolrRefGuide-all.adoc: line 37: invalid part, must
have at least one section (e.g., chapter, appendix, etc.)
     [move] Moving 1 file to /home/hossman/lucene/dev/solr/build/solr-ref-guide/pdf-tmp
...
nocommit:
     [java] Page 753:'Link to bogus page @ anchor that does not exist'=> BOGUS URI: nocommit_bogus_page.pdf#nocommit_bogus_x2
     [java] Page 753:'Link to about @ anchor that does not exist' => link with no page
dest

{noformat}

----

All in all these results are disappointing.

The "Single Page" output behavior of asciidoctor, combined with the "bugs" in asciidoctors
handling of duplicated anchors in page includes, combined with the underlying structure of
the PDF, make it really hard to find the same types of failures we can find when parsing the
jekyll generated pages using our white-box knowledge of "there must be no dup anchors across
all pages"


> create a link+anchor checker for the ref-guide PDF using PDFBox
> ---------------------------------------------------------------
>
>                 Key: SOLR-10934
>                 URL: https://issues.apache.org/jira/browse/SOLR-10934
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: documentation
>            Reporter: Hoss Man
>         Attachments: SOLR-10934.patch
>
>
> We currently have CheckLinksAndAnchors.java which is automatically run against the ref-guide
HTML as part of the build to use JSoup to find bad links/anchors that asciidoctor doesn't
complain about -- but not everyone does/can build the HTML version of the ref-guide sincif
we can e it requires manually installing jekyll.
> The PDF build only requires things installed by ivy (via JRuby) and we already have some
PDFBox based code in ReducePDFSize.java that operates on this PDF every time it's run -- so
if we can find a way to do similar checks using the PDFBox API we could catch these broken
links faster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message