Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with   and other entities in staff EAD and PDF export #1779

Open
sdm7g opened this issue Jan 17, 2020 · 1 comment
Open

issues with   and other entities in staff EAD and PDF export #1779

sdm7g opened this issue Jan 17, 2020 · 1 comment

Comments

@sdm7g
Copy link
Contributor

sdm7g commented Jan 17, 2020

Some of these issues became apparent when testing patch #1745 ( for issue #1720 )

  in some places will generate errors in PDF export, but not always.

Expected Behavior

[1] Should succeed in exporting PDF from Staff as well as Public . apps.
[2] Other named character entities should be encoded properly in EAD exports by being converted into numerical entities and only single lone ampersands should be escaped.
( i.e. the pattern in inner_xml function below needs to be more specific so it doesn't match entities. )

Current Behavior

Depending on where exactly entities are, this function will replace the ampersand with & entity:
https://github.com/archivesspace/archivesspace/blob/master/backend/app/converters/lib/xml_sax.rb#L221

Inserting several instances of   for testing, in one place this is output EAD fragment:
<physdesc id="aspace_b818d732f87bd233ec68c8d8d764afa3"><extent altrender="materialtype spaceoccupied">150 items</extent>&nbsp;<extent altrender="carrier">1 Hollinger box</extent>&nbsp;<dimensions>less than 1 linear foot</dimensions></physdesc>

Which causes an error when producing PDF for undefined nbsp entity.
However, elsewhere it is serialized as:

<physdesc> <dimensions id="aspace_d4005b5f554e4603d55436ef91de7fc4">less than &amp;nbsp; 1 &amp;nbsp; linear foot</dimensions> </physdesc>

And adding some arbitrary named character entities to title also produces escaped ampersands:
<unittitle>M&amp;atilde;ry &amp;Atilde;. Wilson p&amp;atilde;p&amp;eacute;rs</unittitle>

Note that all of those examples seem to display properly in PUI display ( they are known entities to HTML ) and after #1745 fix, they seem to work properly in PUI PDF download.
They also display properly in Staff view (HTML again) and only break when output in EAD/XML or PDF.

Possible Solution

  1. More specific pattern match in inner_xml
  2. translate other entities with HTMLEntities.new.decode as is done in Fix nbsp issue with PUI PDF generation by transforming nbsp to ASCII equivalent 160 #1745

But need to figure out why some ampersands are escaped and other are not first.

Steps to Reproduce (for bugs)

Context

Current behavior is inconsistent between Staff/PUI display and PDF/EAD serialization, and between Staff PDF and Public PDF exports.

Your Environment

  • Version used:
  • Environment name and version (e.g. Chrome 39, node.js 5.4):
  • Operating System and version (desktop or mobile):
  • Link to your project:
@sdm7g
Copy link
Contributor Author

sdm7g commented Jan 17, 2020

My writeup of what was happening is correct, but I shouldn't have tried to diagnose the cause so late at night: the function I pointed to was in EAD converter, not exporter, which is obviously where the problem is. But description of symptoms is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant