Expand description
Shared, language-free document-formatting concept ontology.
Documents in different container formats (Markdown, HTML, and — through the
issues that build on this substrate — PDF and DOCX) express the same
formatting concepts with different surface syntax. A Markdown **bold**, an
HTML <strong>bold</strong>, and a DOCX run with the <w:b/> property all
denote one language-free strong concept.
This module seeds that concept set into a LinkNetwork with per-format
syntax mappings, and provides data-driven resolution and reconstruction so
the same concept link round-trips across formats. The per-format mapping
is stored as a small template string whose {} placeholder marks the
formatted content and whose {name} placeholders mark named attributes
({href}, {src}, {lang}) or the heading level ({markers} for the
Markdown # run, {level} for the HTML digit).
Structs§
- Document
Format Instance - A concept instance ready to be rendered into a target format.
- Document
Format Match - A formatting fragment resolved to its language-free concept.
- Document
Formatting Seed Report - Summary returned after seeding the document-formatting concept set.
- Formatting
Document - A language-free document in the concept layer.
Enums§
- Block
Node - A block in the language-free concept layer.
- Inline
Node - An inline span in the language-free concept layer.
Constants§
- CROSS_
FORMAT_ CONCEPTS - The shared formatting concepts considered when reporting cross-format fidelity. Every format profile classifies each of these as either natively supported or carrying a documented lossy fallback.
- DOCUMENT_
FORMATS - The ordered set of document formats the cross-format reconstruction layer
supports, used as both source and target of
reconstruct_text_as.
Functions§
- canonical_
document_ format - Canonicalizes a format/language label to one of
DOCUMENT_FORMATS. - document_
format_ profile - Returns the capability profile for a document
format, orNonewhen the format is not one of the cross-format reconstruction targets. - docx_
package_ is_ recognized - Whether
bytesis a stored-profile.docxpackage carrying at least one recognized block in itsword/document.xmlpart. - docx_
profile_ is_ recognized - Whether
textis OOXMLdocument.xmlcarrying at least one recognized block. - parse_
docx_ document - Parses OOXML
word/document.xmlmarkup in the documented profile back into the language-free concept layer. - parse_
docx_ package - Parses the
word/document.xmlpart of a stored-profile.docxpackage back into the language-free concept layer. - parse_
markup_ document - Parses
textwritten inlanguageinto the language-free concept layer. - parse_
pdf_ document - Parses a PDF written in the documented text profile back into the language-free concept layer.
- pdf_
profile_ is_ recognized - Whether
textis a PDF in this profile carrying at least one recognized block. - render_
docx_ document - Renders a language-free
FormattingDocumentinto OOXMLword/document.xmlmarkup in the documented profile. - render_
docx_ package - Renders a language-free
FormattingDocumentinto a valid.docx(OPC ZIP) package in the documented stored-entry profile. - render_
pdf_ document - Renders a language-free
FormattingDocumentinto a valid, uncompressed PDF in the documented text profile.