Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle w:instrTex for DOCX to HTML conversion #3389

Closed
trapias opened this issue Jan 27, 2017 · 6 comments
Closed

Handle w:instrTex for DOCX to HTML conversion #3389

trapias opened this issue Jan 27, 2017 · 6 comments

Comments

@trapias
Copy link

trapias commented Jan 27, 2017

Hello,
it looks like there's a problem with hyperlinks, when converting from .DOCX to HTML as described in this discussion on Google groups:

  • if hyperlinks are handled with <w:hyperlink> then Pandoc does recognize them and correctly convert into an "a href" tag, but
  • if hyperlinks are referenced with <w:instrText> it does not, hyperlinks are missing from the resulting HTML.

I do not know what is the reason that causes DOCX documents to be written with the one or the other tag, but as John suggests in his answer it looks like Pandoc does not recognize the latter format.

@jgm
Copy link
Owner

jgm commented Jan 27, 2017

It looks like the form is just

<w:instrText>HYPERLINK "http://example.com"</w:instrText>

So this should be simple to support.

@jgm
Copy link
Owner

jgm commented Feb 4, 2017

It's actually more complex than I thought. This whole contraption has to occur in a structure like this:

<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> DATE </w:instrText>
</w:r>
<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r>
<w:t>12/31/2005</w:t>
</w:r>
<w:r>
<w:fldChar w:fldCharType="end"/>
</w:r>

See http://officeopenxml.com/WPfields.php
Field types are documented here: http://officeopenxml.com/WPfieldInstructions.php

@jkr
Copy link
Collaborator

jkr commented Jan 16, 2018

@jgm and @trapias -- just wanted to let you know that, a year later, I finally addressed this. Definitely not a good first issue -- accumulating runs in state, introducing a new module with its own little parsec parser. But there's now a framework for handing further fldChar/instrText directives in docx documents.

@trapias
Copy link
Author

trapias commented Jan 16, 2018

@jkr great, thank you! Will make a test ASAP 👊

@jkr
Copy link
Collaborator

jkr commented Jan 16, 2018

Great -- if you do create a test document, can you post it? I'd prefer to replace the one that's up there now, but recent versions of Word don't seem to be able to produce links as fields.

@trapias
Copy link
Author

trapias commented Jan 16, 2018

@jkr for sure - cannot assure I can before 10/15 days, but finally will revert to you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants