Module document_formatting

Expand description

Shared, language-free document-formatting concept ontology.

Documents in different container formats (Markdown, HTML, and — through the issues that build on this substrate — PDF and DOCX) express the same formatting concepts with different surface syntax. A Markdown **bold**, an HTML <strong>bold</strong>, and a DOCX run with the <w:b/> property all denote one language-free strong concept.

This module seeds that concept set into a LinkNetwork with per-format syntax mappings, and provides data-driven resolution and reconstruction so the same concept link round-trips across formats. The per-format mapping is stored as a small template string whose {} placeholder marks the formatted content and whose {name} placeholders mark named attributes ({href}, {src}, {lang}) or the heading level ({markers} for the Markdown # run, {level} for the HTML digit).

Structs§

DocumentFormatInstance: A concept instance ready to be rendered into a target format.
DocumentFormatMatch: A formatting fragment resolved to its language-free concept.
DocumentFormattingSeedReport: Summary returned after seeding the document-formatting concept set.
FormattingDocument: A language-free document in the concept layer.

Enums§

BlockNode: A block in the language-free concept layer.
InlineNode: An inline span in the language-free concept layer.

Constants§

CROSS_FORMAT_CONCEPTS: The shared formatting concepts considered when reporting cross-format fidelity. Every format profile classifies each of these as either natively supported or carrying a documented lossy fallback.
DOCUMENT_FORMATS: The ordered set of document formats the cross-format reconstruction layer supports, used as both source and target of reconstruct_text_as.

Functions§

canonical_document_format: Canonicalizes a format/language label to one of DOCUMENT_FORMATS.
document_format_profile: Returns the capability profile for a document format, or None when the format is not one of the cross-format reconstruction targets.
docx_package_is_recognized: Whether bytes is a stored-profile .docx package carrying at least one recognized block in its word/document.xml part.
docx_profile_is_recognized: Whether text is OOXML document.xml carrying at least one recognized block.
parse_docx_document: Parses OOXML word/document.xml markup in the documented profile back into the language-free concept layer.
parse_docx_package: Parses the word/document.xml part of a stored-profile .docx package back into the language-free concept layer.
parse_markup_document: Parses text written in language into the language-free concept layer.
parse_pdf_document: Parses a PDF written in the documented text profile back into the language-free concept layer.
pdf_profile_is_recognized: Whether text is a PDF in this profile carrying at least one recognized block.
render_docx_document: Renders a language-free FormattingDocument into OOXML word/document.xml markup in the documented profile.
render_docx_package: Renders a language-free FormattingDocument into a valid .docx (OPC ZIP) package in the documented stored-entry profile.
render_pdf_document: Renders a language-free FormattingDocument into a valid, uncompressed PDF in the documented text profile.