XML Tag API

The Imprint engine comes with a complete set of processors for the tags specified in the XML Template Specification. However, additional tags may be necessary for highly customized applications, so an API exists for defining and registering new tags. The API is defined in the imprint.core.tags module. Example usage can be found in the Writing Custom Tags tutorial.

Tag Descriptors

The tag API revolves around the TagDescriptor class. The class can be extended directly, or instantiated through a delegate object that fulfills the necessary duck-type API. Objects contain a set of attributes and two callbacks that define how to handle XML tags of a given type. All the elements are optional and have sensible default values.

Any registered object will be viewed through TagDescriptor.wrap, so it is not necessary to extend or instantiate TagDescriptor to create a working tag descriptor.

Errors

Tag descriptors may raise any type of error they deem necessary in their start and end methods. Most classes of errors will be logged and cause the application to abort. However, two special classes of errors will not cause a fatal crash:

  1. KnownError is used to flag known conditions that can be handled gracefully by the tag.
  2. OSError. Specifically, the FileNotFoundError and PermissionError subclasses are deemed to be “known errors”. If they represent a fatal condition, they should be wrapped in another exception type.

Any plugins with a dynamic Data Configuration will generally receive an alt-text placeholder where the content would normally go instead of completely aborting.

exception imprint.core.KnownError

A custom exception class that is used by the engine to indicate that a tag or plugin handler exited for a known reason.

In cases where this exception is logged, the message is printed without a stack trace.

Configuration

Tags have two types of configuration available to them. Static configuration for a given XML Template is provided through the tag attributes in the XML file. Dynamic configuration through the IDC File can be enabled to provide per-document fine-tuning.

XML Attributes

XML attributes are supplied to the start and end methods of a TagDescriptor as the second argument. The inputs are presented to both methods as a vanilla dict. The dictionary are meant to be treated as read-only, but this is not a requirement, meaning that technically start can modify what end sees. The dictionary is filtered to exclude any attributes that are not listed in the required and optional elements of the TagDescriptor.

Data Configuration

For some types of content, static configuration is not enough. To allow per-document configurations, a TagDescriptor must define a non-None data_config attribute. This attribute gives the name of the dictionary to extract from the IDC File.

start and end methods of a TagDescriptor with the data_config attribute set will receive an additional input argument containing the Data Configuration loaded from the IDC File.

The data configuration can override some of the static XML Attributes of a tag. For built-in tags, the XML Template Specification notes which attributes can be overriden. Built-in tags that support dynamic configuration are <figure>, <table> and <string>.

All built-in tags that support dynamic configuration also support a type of plugin, but this is not a requirement for custom tags.

References

A TagDescriptor is referenceable if it has a non-None reference. A reference made to a tag will be substituted by the appropriate reference text. By default reference tags have the target tag name with “-ref” appended: <figure-ref> references <figure>, <table-ref> references <table>. A notable exception is <segment-ref>, which references paragraphs (<par> tags), but only ones that have a heading style.

References are usually identified by a required id attribute. Segments can also be identified by the title of the segment, which is the aggressively trimmed collection of all the text in the text in the paragraph. For example, the title of the following XML snippet would be 'Example Heading':

<par style="Heading 3">
    <run style="Default Paragraph Font">
        Example
        Heading
    </run>
</par>

<segment-ref> tags can therefore identify their target with either a id or title attribute. User-defined tags can implement their own customized rules for identiying targets.

Roles

For the purpose of creating references, any tag may impersonate, or play the role of, any other tag using a special role attribute. This attribute is implicitly optional for every tag. It is interpreted directly by the parsers in the Engine Layer to determine the type of reference that a tag will represent.

For example, a <table> tag (or any other tag for that matter), which has role="figure" must be referenced by a <figure-ref> tag, not a <table-ref> tag, in the XML Template. That table will be a figure for the purposes of the document in question.

Any arbitrary tag can be referenced the same way with the appropriate role. Usually, such a referenceable tag will be styled appropriately, and will have the headings, captions, etc. appropriate for its role rather than its nominal tag.

A specific case is arbitrary tags that have a <par> role. Such tags are automatically referenceable by <segment-ref>. Their entire contents will be treated as the title of the heading, so the par role must be used carefully.

Registering New Tags

Once a TagDescriptor or a delegate object has been constructed, there are two main ways to get Imprint to use the descriptor for actual tag processing.

Via Configuration

In the normal course of things, Imprint will not automatically import unspecified user-defined modules. To let it know where to find tag extensions, add them by name or by reference to the IPC File to the mapping in the tags keyword. This will automatically import all the necessary modules, and register the custom descriptor under the requested tag name.

Programatically

Under the hood, tags are registered with the Imprint core simply by adding them to tag_registry:

tag_registry[name] = descriptor

The registry is a special mapping that ensures that name is a string not representing an existing tag. While it is not possible to remove or overwrite existing tags, the same descriptor can be registered under multiple names.

This method is useful mostly to users wishing to write a custom driver program for the engine. Under normal circumstances, the configuration solution will be more suitable.

Engine State

Both callbacks of a TagDescriptor accept an EngineState object as their first argument, which supports stateful tag processing. The engine state provides a mutable container for arbitrary attributes. Each TagDescriptor can add, remove and modify attributes of the state object to communicate with itself, the engine, and other tags.

As a rule, objects should prefer to delete state attributes rather than setting them to None. This meshes well with the fact that EngineState provides a containment check. For example, to check if the parser is in the middle of a run of text, descriptors should check

if 'run' in state: ...

The built-in tags and the engine use a set of attributes and methods to operate properly. Modifying these predefined attributes in a way other than explicitly documented will almost inevitably lead to unexpected behavior. Properties are used instead of simple attributes in a few cases to provide sanity checks for the supported modifications. Custom tags can add, remove and modify any additional attributes they choose. The full list of built-in attributes is available in the EngineState documentation.

The API

The imprint.core package contains the Imprint Engine Layer. The tags and state modules implement most of the functionality useful to end-users through the public XML Tag API. The parsers and utilities contain the Internal API.

The imprint.core.tags module implments the base XML Tag API, as well as the all the predefined Built-in Tag Descriptors and Reference Descriptors.

The following members are used to construct and register new tags:

imprint.core.tags.tag_registry = {}

A limited mapping type that contains all the currently registered tag descriptors.

Registering a new descriptor is as easy as doing:

tag_registry[name] = descriptor

The registry is a restricted mapping type that supports adding new elements only if they are not already registered. Existing elements can not be deleted. Deletion operations will raise a TypeError, while overwriting existing keys will raise a KeyError. Aside from that, all operations supported by dict are allowed (including things like update).

Any tag that is referenceable by design (has a valid reference attribute) will have the ReferenceDescriptor’s registration hook invoked after the tag-proper is registered.

The built-in tags are registered when the current module is imported.

imprint.core.tags.referable_tags

A convenience property to compute a list of all the keys that have a non-None reference attribute in their values.

class imprint.core.tags.TagDescriptor(delegate)

The basis of the tag API.

Instances of this class contain the information required to process a custom tag. They must contain all of the attributes listed below, with the expected types. The elements in tag_registry may be delegate objects that supply only part of the attibute set. In that case, they are wrapped in a proxy as needed at runtime, never up-front. The reason for this is twofold:

  1. There may be stateful objects registered for multiple tags, and wrapping in a proxy will not allow the tags to share state. This would not be a problem, except it would be unexpected behavior.
  2. Some of the attributes may be dynamic properties (or other descriptors). Fixing the value once would completely defeat such behavior.

Creating an occasional wrapper around a delegate is not expected to be particularly expensive, even if it had to be done for every tag encountered in the XML file. On the other hand, it allows for some very flexible behaviors. At the same time, very few instances of wrapping should occur, since most tags will be implemented by extending this class and implementing it properly. The wrap method ensures that all extensions are passed through as-is.

All the Built-in Tag Descriptors are instances of children of this class.

content

A tri-state bool flag indicating whether the tag is allowed/expected to have textual content or not. The values are interpreted as follows:

None
The tag may not have any content. It must be of the form <tag/> or <tag><otherTag>...</otherTag></tag>. Anything else will raise a fatal error. If tags is set to False, only the former form is allowed.
False
The tag should not have content, but content will not raise an error. A warning will be raised instead.
True
The tag is expected to have content, but the content may be empty.

Any value is allowed in a delegate. If defined, the value will be converted to bool if it is not None. Defaults to None if not defined.

tags

A bool indicating whether or not nested tags are allowed within this one.

Any value is allowed in a delegate. If defined, the value will be converted to bool. Defaults to True if not defined.

required

A tuple of strings containing the name of required tag attributes. A tag encountered without all of these attributes will raise an error.

In a delegate, this may be a single string, an iterable of strings, None or simply omitted. Every element of an iterable must be a string, or a TypeError is raised immediately during construction. Defaults to an empty tuple if not defined.

optional

A dictionary mapping the names of optional attributes to their default values. Optional attributes are ones that are expected to be present in processing, but have sensible defaults that can be used, meaning that they do not have to be specified explicitly in the XML Template.

In a delegate, this may be any mapping type, an iterable of strings, a single string, None or simply omitted. In the case of an iterable or individual string, all the defaults will be None. Iterables and mapping keys must be strings, or a TypeError will be raised during contruction. Defaults to an empty dict if not defined.

data_config

The name of the attribute containing the data configuration name for the tag. This should only be provided for tags that require Data Configuration. If provided, this tag will automatically be added to the required sequence.

In a delegate, this object must be an instance of str or None. Defaults to None if not defined.

reference

A ReferenceDescriptor that is only present if this type of tag can be the target of a reference.

Examples of referrable built-in tags are <figure>, <table> and sometimes <par>. Referrable tags can have an optional role attribute that changes the type of reference they represent. See the Roles description for more information.

In a delegate, this object must be an instance of ReferenceDescriptor or None. Defaults to None if not defined.

__init__(delegate)

After completion, this instance has all of the required attributes defined in the delegate, wrapped in the required types.

A reference to the delegate object is not retained. This method can be invoked multiple times. It updates the current descriptor with the attributes of the delegate, leaving undefined attributes in the delegate untouched.

static __new__(cls, *args, **kwargs)

Create an empty instance, with all required attributes set to default values.

This method is provided to allow bypassing the default __init__ in child classes. All arguments are ignored.

end(state, name, attr, *args)

Each descriptor should provide a method with this signature to process closing tags.

If implemented, this method must accept the Engine State, a tag name and a dict of attributes. Normally, the tag name is ignored since a separate descriptor is registered for each tag. The attributes are the same as those passed to start, barring any modifications made in start.

Descriptors that have a non-None data_config attribute set will receive an additional argument containing the Data Configuration.

The default implementation just logs itself.

start(state, name, attr, *args)

Each descriptor should provide a method with this signature to process opening tags.

If implemented, this method must accept the Engine State, a tag name and a dict of attributes. Normally, the tag name is ignored since a separate descriptor is registered for each tag.

Descriptors that have a non-None data_config attribute set will receive an additional argument containing the Data Configuration.

The default implementation just logs itself.

classmethod wrap(desc)

Construct a proxy from the descriptor if it isn’t already one.

This method is provided so that when TagDescriptor objects are implemented properly up front, they do not need to be wrapped in an additional layer.

If the input is a delegate, the return value will always be of the type that this method was invoked on. However, the type check will always be done agains the base TagDescriptor class.

class imprint.core.tags.BuiltinTag(delegate=None, **kwargs)

Bases: imprint.core.tags.TagDescriptor

The base class of all the built-in TagDescriptor implementations.

Custom tag implementations are welcome to use this class as a base instead of a raw TagDescriptor.

__init__(delegate=None, **kwargs)

Updates the required fields with the keywords that are passed in.

If no delegate object (or None) is supplied, bypass the default constructor (see TagDescriptor.__new__). kwargs will override any defaults and attributes set by a delegate.

Built-in Tag Descriptors

The existing tag descriptors implement the XML Template Specification:

class imprint.core.tags.BreakTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <break> tag.

end(state, name, attr)

Insert a page break into the document.

class imprint.core.tags.ExprTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <expr> tag.

Warning

This descriptor uses eval to execute arbitrary code and assign it to a new keyword. Use with extreme caution!

end(state, name, attr)

Evaluate the expression found inside the tag, and add a new entry to the state’s keywords.

The content_stack will be popped.

All errors in importing and evaluation will be propagated up and will terminate the parser.

start(state, name, attr)

Begin a new expression.

This just pushes a new content_stack entry in the state. All content until the closing tag will be evaluated as a set of Python statements.

class imprint.core.tags.FigureTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <figure> tag.

end(state, name, attr, config)

Generate and insert a figure based on the selected handler.

Figures can appear in a run, a paragraph, or on their own.

start(state, name, attr, config)

Just log the tag.

class imprint.core.tags.KwdTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <kwd> tag.

start(state, name, attr)

Find the value of the keyword in the state’s keywords and place it into the current content.

If the keyword is not found, a KeyError will be raised. If the tag has a format attribute, it is interpreted as a format_spec, and used to convert the value. If the attribute is not present, the value is converted with a simple call to str.

class imprint.core.tags.LatexTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <latex> tag.

end(state, name, attr)

Convert the equation in the text of the current tag into an image using haggis.latex_util.render_latex, and insert the image into the parent tag.

The parent can be a run or a paragraph. If the requested run style does not match the current run, the current run will be interrupted by a run containing a new picture with the requested style, and resumed afterwards. If there is no run to begin with, a new run will be created, but not stored in the run attribute of the state.

Formulas are rendered at 96dpi in JPEG format by default.

start(state, name, attr)

Begin a new LaTeX formula.

Just push a new content_stack entry into state. All content until the closing tag is evaluated as a LaTeX document.

class imprint.core.tags.NTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <n> tag.

start(state, name, attr)

Add a line break to the current run.

If not inside a run, append the break to the last run. Make a new run only at the start of a paragraph. Ignore with a warning outside of a paragraph.

class imprint.core.tags.ParTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <par> tag.

check_list(state, attr)

Validate the list attribute that is found.

Log an error if the attribute is invalid, but do not terminate processing. The attribute is simply ignored if the list is neither numbered, bulleted nor continued.

Return the type normalized to a ListType, or None if not a list item. If the type is valid, and list-level is set, it is converted to an integer.

compute_paragraph_style(state, attr, list_type)

Compute the paragraph style based on whether an explicit style is set in the attributes, and whether or not the paragraph is a list.

  1. If an explicit style is requested, return it. Otherwise:
  2. If the paragraph is not a list, return the default paragraph style. Otherwise:
  3. If the previous paragraph is a list item in the same list (i.e., the current list-level attribute is non-zero), return the style of the previous paragraph. Otherwise:
  4. Return the default list item style.
Parameters:
  • state (EngineState) – The state is used to check for the previous item’s style in case #3.
  • attr (dict) – The tag attributes, used to check for an explicitly set style as well as for a style reset with list-level = 0.
  • list_type (ListType or None) – The type of the list, if a list at all, as returned by check_list.
end(state, name, attr)

Terminate the current paragraph.

See end_paragraph in EngineState.

start(state, name, attr)

Terminate any existing paragraph, flush all text and start a new paragraph.

If the new paragraph is a list item, add the necessary metadata to it.

Issue a warning if an existing paragraph is found.

class imprint.core.tags.RunTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <run> tag.

end(state, name, attr)

Place any remaining text into the current run, and remove run attribute of state.

start(state, name, attr)

Create a new run, ensuring that there is a paragraph to go with it.

Creating a run outside a paragraph raises a warning and creates a paragraph with a default style. See imprint.core.state.EngineState.new_run.

class imprint.core.tags.SectionTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <section> tag.

start(state, name, attr)

Begin a new section in the document, optionally altering the page orientation.

class imprint.core.tags.SkipTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <skip> tag.

class imprint.core.tags.StringTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <string> tag.

end(state, name, attr, config)

Generate a string based on the appropriate handler.

If the log_images key is set to a truthy value in state. keywords, the content will also be dumped to a file.

start(state, name, attr, config)

Just log the tag.

class imprint.core.tags.TableTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <table> tag.

end(state, name, attr, config)

Generate and inserts a table based on the selected handler.

The handler creates the table directly in the document (unlike for figures, where only the final product is inserted). Any error that occurs mid-processing leaves a stub table in the document in addition to the automatically-inserted alt-text.

Tables appear on their own, outside any paragraph or run, so if a table is nested in a run or paragraph, a warning will be issued. Any interrupted run or paragraph resumes after the table with their prior styles.

start(state, name, attr, config)

Just log the tag.

class imprint.core.tags.TocTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <toc> tag.

end(state, name, attr)

Terminate and insert the TOC.

Gather any text that has been acquired into the heading, which will be a separate pargraph preceding the TOC.

If the TOC interrupted an existing paragraph, a new paragraph will be resumed with the same style as the original. If a run style is present as well, a run will be recreated too.

start(state, name, attr)

Create a new TOC.

Log a warning if the tag appears within a paragraph. Truncate the paragraph, and resum with the prior style. The same happens to the current run, if there is one.

class imprint.core.tags.ReferenceProcessor(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <figure-ref> and <table-ref> tags.

This processor is not registered explicitly. It gets added by all of the target tags that use it as part of their registration process. Registering this processor under a name that does not end in '-ref' will lead to a runtime error in resolve.

end(state, name, attr)

Insert a string with the specified reference into the current content.

classmethod get_instance()

Returns a quasi-singleton instance of the current class.

This instance is not exposed directly, but it is registered by the built-in referencable tags.

resolve(state, name, attr)

Overridable operation for fetching and logging the reference that is to be inserted.

The default is to look up the reference by 'id' in the imprint.core.state.EngineState’s. references.

Used by the default implementation of end.

class imprint.core.tags.SegmentRefProcessor(delegate=None, **kwargs)

Bases: imprint.core.tags.ReferenceProcessor

Implements the <segment-ref> tag.

This is a special case of ReferenceProcessor that allows access by both title and id. It’s references always resolve to a <par> tag, or a tag playing that role.

resolve(state, name, attr)

Resolve a segment reference be either text or ID.

Either the id or title tag attribute must be present. If both are present, they must resolve to the same heading in the document or an error is raised.

Reference Descriptors

class imprint.core.tags.ReferenceDescriptor(prefix, identifiers='id')

Defines the process for creating References and using them through the appropriate tag.

References are made by processing the XML Template and mapping out any referenceable tags using the start and end methods. In the default implementation, the reference text is created by the make_reference method, invoked from end.

start and end return a boolean value to allow custom tags to be processed selectively. A return value of False from either method means that that the specific instance of the tag being processed is not a valid reference target. Normally both methods always return True, but for the builtin <par> tag, for example, an exception must be made.

References are placed into the document by a special TagDescriptor, which is generally registered along with the parent tag that contains a ReferenceDescriptor using the register method.

Current references are purely textual, rather having a dynamic field assigned to them. This is still a work in progress.

prefix

The prefix that normally gets prepended to the reference text. Used by make_reference to construct the output string. Extensions are welcome to ignore this attribute.

identifiers

A string or iterable of strings that lists the attributes that are used to identify target for this reference type. The attribute may be either required or optional for the target tag, but it must be recognized either way. This attribute is used to check for attributes on tags with a non-default role. Defaults to 'id'.

end(state, name, role, attr)

Process the closing tag for a referencable tag.

The default is to add the reference to the appropriate map in references by ID, based on the role, and log the operation. The attribute id is required.

The actual reference is created by make_reference.

Returns True if the tag is definitely a reference target, False if not.

identifiers

Ensure that identifiers is read-only.

make_reference(state, role, attr)

Returns a string refering to the specified tag in the specified role.

Keep in mind that the ReferenceDescriptor is selected based on the role, not necessarily the tag name. Therefore, the role argument should always be the “computed” role: the name of the tag should be overriden by the value of the attribute, if it was specified.

register(registry, name, descriptor)

A registration hook that is invoked when the parent TagDescriptor is registered.

The default implementation registers an additional TagDescriptor under the name name + '-ref', which replaces the <name-ref/> tag with the formatted reference. See ReferenceProcessor.

Parameters:
  • registry – The tag registry that the parent TagDescriptor is being inserted into. See tag_registry for details on the interface.
  • name (str) – The name under which the parent tag is being registered.
  • descriptor – The parent object being registered, not necessarily a TagDescriptor. The TagDescriptor.wrap method can be used to retreive the corresponding TagDescriptor if necessary.
set_reference(state, role, attribute, key, reference, duplicates=False)

Check that the reference identified by key does not already exist and set it.

Duplicate reference targets cause an error, unless duplicates is True, in which case a warning is logged and the new value is discarded.

start(state, name, role, attr)

Process the opening tag for a referencable tag.

The default is to log the tag and its role.

Returns True if the tag is a potential reference target, False if it is definitely not.

class imprint.core.tags.SegmentReferenceDescriptor(prefix, identifiers='id')

Bases: imprint.core.tags.ReferenceDescriptor

Extension of ReferenceDescriptor to accumulate heading text and allow references through the title attribute.

Used by <par> tags to create heading references.

heading_style_name_pattern

A class-level regular expression for identifying the <par> tags that represent referenceable headings.

end(state, name, role, attr)

Create a dual reference based on the title and optional ID in addition to the default logging.

identifiers

Ensure that identifiers is read-only.

make_reference(state, role, attr, title)

Add the section heading to the usual reference text.

register(registry, name, descriptor)

Register a SegmentRefProcessor for the <segment-ref> tag.

This registration hook uses a fixed name, so can only be called once.

set_reference(state, role, attribute, key, reference, duplicates=False)

Check that the reference identified by key does not already exist and set it.

Duplicate reference targets cause an error, unless duplicates is True, in which case a warning is logged and the new value is discarded.

start(state, name, role, attr)

Start accumulating content in addition to the default logging.

If an actual <par> tag is encountered (as opposed to a tag playing that role), and the heading matches Heading \d+, the current heading is incremented in the state.

If any heading tag, or any tag with role="par" is encountered, a new reference will be created. Non-heading paragraphs with no explicit role are non-referenceable. A non-heading paragraph can be made referenceable by explicitly setting the role.

Keep in mind that the title for a segment reference is accumulated from all the text in the paragraph. Use carefully with non-default tags.

Utility Functions

imprint.core.tags.get_key(key, attr, data, sentinel=None, default=None)

Resolve the value of key with respect to attr, but with the option to override by the data configuration dictionary.

If the final value is sentinel, return default instead. Return default if key is missing entirely as well. Both attr and data must be mapping types that support a get method.

imprint.core.tags.get_size(attr, data, key='size')

Convert a string, number or pre-constructed size to a docx.shared.Length object, using get_key for value resolution.

Common options for key are 'width' and 'height'.

Valid units suffixes are ", in, cm, mm, pt, emu, twip. Default when no units are specified is inches (").

imprint.core.tags.get_handler(tag, id, attr, data, logger, key='handler')

Retrieve and load the handler for the specified attribute mapping and data configuration.

If the handler can not be found, a detailed exception is logged and a KnownError is raised.

imprint.core.tags.get_and_run_handler(tag, id, attr, data, logger, args, kwargs=None, key='handler')

Load and run the handler for the specified attribute mapping and data configuration.

If the handler can not be found, a detailed exception is logged, as with get_handler.

All exceptions that occur during execution are converted into KnownError.

imprint.core.tags.compute_styles(attr, data, defaults)

Compute the required styles based on attr and data configurations.

Style keys are taken from the keys of defaults, while values provide the fallback names used if the keys do not appear in either attr or data. Similarly named keys in data will override ones in attr.

imprint.core.tags.compute_size(tag, attr, data, logger, default_width=None, width_key='width', height_key='height')

Create a dictionary with keys width and height and values that are instances of docx.shared.Length.

Values are resolved according to the rules of get_key, with width_key and height_key as the inputs. String values may contain units, and will be parsed according to get_size.

If neither key is present in either configuration (or present but set to None), set the the width to default_width. If that is None as well, return an empty dictionary.

Parser State Objects

The imprint.core.state module supplies the state objects that enable communication within the Engine Layer between the engine itself and the tags. The state is therefore crucial to the XML Tag API without being completely a part of it.

class imprint.core.state.EngineState(doc, keywords, references, log)

A simple container type used by the main parser to communicate document state to the tag descriptors.

Most of the state is dedicated to monitoring the status of the text acquisition from the XML. The engine and built-in tags rely on a set of attributes to function. A description of acceptable use of these attributes is provided here. Any other use may lead to unexpected behavior. Custom tags may define and use any attributes that are not explicitly documented as they choose.

This class allows for a containment check using in in preferece to hasattr.

doc

docx.document.Document

The document that is being built. Set once by the engine.

Implemented as a read-only property.

keywords

dict

The keywords configured for this document by the IPC File. Normally, this dictionary should be treated as read-only, but ExprTag can add new entries.

As a rule, keywords with lowercase names are system configuration options, while keywords that start with upper case letters affect document content.

Implemented as a read-only property.

references

ReferenceMap

A multi-level mapping type that allows references to be fetched by role and attribute. Access to this map is performed by providing a tuple (role, attribute, key). For example:

state.references['figure', 'id', 'my_figure']

The map’s values may be of any type, as long as they can be converted to the desired content using str.

The mapping is made immutable as soon as it becomes part of the state. The read-only lock is irreversible.

Implemented as a read-only property.

paragraph

docx.text.paragraph.Paragraph

A paragraph represents a collection of runs and other objects that make up a logical segment in a document. This attribute exists only when parsing a <par> tag. Usually set and unset by ParTag, but can be temporarily switched off and reinstated in response to other tags as well. end_paragraph deletes this attribute.
run

docx.text.run.Run

A run is a collection of characters with similar formatting within a paragraph. This attribute exists only when parsing a <run> tag. Usually set and unset by RunTag. end_paragraph deletes this attribute.
content

io.StringIO

A mutable buffer used by the engine to accumulate text from the XML Template.

Since whitespace needs to be trimmed rather aggressively from an XML file, this object gets an extra (non-standard) attribute:

content.leading_space

Indicates whether or not to prepend a space when concatenating this buffer with others. In general, the text of the first run in a paragraph is the only one that does not have this attribute set to True. This flag is set on the buffer rather than the state object itself so that buffers can be pushed and popped into the content_stack to handle nested tags.

This attribute should be manipulated mostly through the new_content, get_content and flush_run methods.

This attribute must always be present, regardless of the position within the document.

Implemented as a read-write property that can not be deleted or set to None.

content_stack

collections.deque[io.StringIO]

A stack for nested content buffers. Each buffer represents a tag containing independent content. Some tags append to the parent’s buffer, some close the current buffer to start a new one and others, such as <figure>, use a temporary buffer for their content.

The stack allows for a theoretically indefinite level of nesting of text elements. In reality, it will only contain one or two elements: the current run text and the contents of interpersed tags like <figure>.

This attribute should be maniplated through the push_content_stack and pop_content_stack methods.

This attribute may be empty, but never missing. Implemented as a read-only property.

last_list_item

docx.text.paragraph.Paragraph

List items in Word are just paragraphs with a particular style and numbering scheme. All of this information can be gathered from the previous paragraph that was assigned a concrete list numbering instance.

This attribute should never be missing. It should only be None to indicate that no prior numbered paragraph has occured in the document yet. To this end, it is implemented as a read-only property.

latex_count

int

A counter for the number of <latex> tags encountered so far. Used to generate the file name for the equations if Image Logging is enabled. Missing otherwise.

__contains__(name)

Checks if the specified name represents an attribute.

check_content_tail()

Include any remaining text in content into the last run of the last paragraph.

This ensures that paragraphs get truncated properly, and that spurious text between paragraphs is cleaned up.

A warning is issued if any non-whitepace text is found.

end_paragraph(tag=None)

Terminate the current paragraph.

Any existing run is immediately terminated. Spurious text is appended to the last available run. Both paragraph and run attributes are deleted by this method.

If there is no paragraph to terminate, this method is equivalent to calling check_content_tail.

Parameters:tag (str or None) – The name of a tag that interrupts the paragraph. If present, a warning will be issued. If omitted, no warning will be issued.
flush_run(renew=True, default='')

Flush the text buffer accumulating the current run into the document.

Text flushing aggressively removes whitespace from around individual lines. A single space character is prepended before the text if content.leading_space is True.

If not inside a run, this is a no-op.

Parameters:
  • renew (bool) – Whether or not to create a new text buffer when finished. This is generally a good idea, since the content will already be in the document, so the default is True. The new buffer has leading_space set to True.
  • default (str) – The text to insert if the current content buffer is empty. Defaults to nothing ('').
get_content(default='')

Retrieve the text in the current content buffer.

Whitespace is stripped from each line in the text, which is then recombined with spaces instead of newlines.

If the buffer is empty (or contains only whitespace), return default instead.

If the text is non-empty, and content has leading_space set to True, prepended a space.

image_log_name(id, ext='')

Create an output name to log an image (or data), for a Data Configuration with the given ID, and an optional extension.

This is the standard name-generator for any component ( tag descriptor or plugin handler) that enables image logging in response to log_images.

The base name is the result of concatenating an extension-less log_file (or output_docx if not set), with id, separated by an underscore. ext is appended as-is, if provided.

inject_par(style='Default Paragraph Font', pstyle='Normal', text='')

Insert a new paragraph into the document with the specified styles and text, and return it.

The contents of the paragraph will be a single run with the specified text. Any previously existing paragraph and run will be terminated (see end_paragraph) and reinstated with their proir styles once the new content is inserted.

Parameters:
  • style (str) – The name of the character style to use for the inserted run.
  • pstyle (str) – The name of the paragraph style to apply to the new paragraph.
  • text (str) – The optional text to insert into the new run.
Returns:

  • par (docx.text.paragraph.Paragraph) – The newly created paragraph. This will be a temporary object that is never set as paragraph.
  • run (docx.run.Run) – The newly created run. This will be a temporary object that is never set as run.

insert_picture(img, flush_existing=True, style='Default Paragraph Font', pstyle='Quote', **kwargs)

Insert an image into the current document.

Images must be inserted into a run, so the following cases are recognized:

Outside <par>
Create a new temporary Paragraph and a new Run. Neither object is retained (i.e. in paragraph and run).
Inside <par> but outside <run>
Create a new temporary Run, which will not be retained.
Inside <run>
If the requested style matches the style of the current run, it will be flushed and extended. Otherwise, the current run will be interrupted by a temporary run with the new style, and then reinstated.

It is an error to have a run outside a paragraph.

Parameters:
  • img (str or file-like) – The image can be the name of a file on disk, or an open file (including in memory files like io.BytesIO). In the latter case, the file pointer must be at the beginning of the image data.
  • style (str) – The name of the Character Style to apply to a new run.
  • pstyle (str) – The name of the Paragraph Style to apply if a new paragraph needs to be created.

Two additional keyword-only arguments can be supplied to add_picture: width and height.

interrupt_paragraph(warn=None)

A context manager for interrupting the current run/paragraph and resuming it when complete.

The current paragraph and run are ended before the body of the with block executes. They are reinstated afterwards, if they existed to begin with, with the same styles as before.

Parameters:warn (str, bool or None) – If a boolean, determines whether or not to issue a generic warning if a paragraph is actually interrupted. If a string, it is interpreted as the name of the tag that is interrupting the paragraph, and mentioned in the warning. No warning will be issued if falsy. Defaults to None.
log(lvl, msg, *args, **kwargs)

Provide access to the engine’s logging facility.

Usage is analagous to logging.log. XML location meta-data will be inserted into any log messages.

new_content(leading_space=None)

Update the content text buffer to a new, empty StringIO.

Calling this method is faster than doing a seek-truncate according to http://stackoverflow.com/a/4330829/2988730.

Parameters:leading_space (tri-state bool) – If None, copy leading_space from the current content. Otherwise, set to the provided value. The default is to copy the existing value.
new_run(tag, style='Default Paragraph Font', pstyle='Normal', check_in_par=True, keep_par=True)

Create a new run.

This method handles cases when a run is requested outside a paragraph, or inside an existing run:

  • Nested runs are forbidden, but run injection is not.
    • Existing content is flushed for injected runs.
  • Runs outside a paragraph will generate a temporary paragraph with a default style.
    • Missing paragraphs can optionally raise a warning.
    • The temporary paragraph can optionally be retained as the current paragraph.
Parameters:
  • name (str) – The name of the tag requesting the run. If there is already a run attribute present, setting name='run' will raise an error because of nesting.
  • style (str) – The name of the style to use for the new run.
  • pstyle (str) – The name of the style to use for a new paragraph, if one has to be created. Moot if there is already a paragraph attribute.
  • check_in_par (bool) – Whether or not to warn if not in a paragraph. Defaults to True.
  • keep_par (bool) – Whether or not to retain a newly created paragraph object in the paragraph attribute. Moot if there is already a paragraph attribute.
Returns:

  • par (docx.text.paragraph.Paragraph) – The paragraph that the run was added to. If keep_par is True or there was already a paragraph attribute set, this will be the paragraph attribute.
  • run (docx.run.Run) – The newly created run. This will be set to the run attribute unless there is no existing paragraph attribute, and keep_par is set to False.

Notes

Setting keep_par to False for a <run> tag outside a paragraph will cause a situation where run is set but paragraph is not. This may cause a problem for the engine, but should never arise with the builtin parsers.

number_paragraph(list_type, level)

Turn the current paragraph into a list item, and store it into last_list_item.

The exact numbering scheme depends on last_list_item, which will be updated to refer to the current paragraph when this method completes.

The following behaviors occur in response to list_type:

list_type Behavior
None Not a list paragraph. Do not set numbering or change last_list_item.
CONTINUED Same type and numbering as last_list_item. Set last_list_item.
NUMBERED Start a new numbered list. Set last_list_item.
BULLETED Start a new numbered list. Set last_list_item.
Parameters:
  • list_type (ListType or None) – The type of list to number with, if at all.
  • level (int or None) – The depth of the list indentation. None means to follow the level of the previous list item, if any, or use zero depth.
pop_content_stack()

Reinstate the previous level of the content_stack to the current content.

Calling this method on an empty stack will cause an error. The current content is completely discarded.

push_content_stack(flush=False, leading_space=False)

Temporarily create a new text buffer for the content.

If flush is True, the old buffer is flushed to the document and cleared before being pushed to the content_stack. If flush is False, the existing buffer is pushed unchanged. If the content is flushed, its leading_space attribute is set to True.

If the existing buffer is flushed, the buffer that will be reinstated when the new one is popped will have leading_space set to True.

The new buffer can have its leading_space attribute configured by the leading_space parameter, which defaults to False.

temp_run(style='Default Paragraph Font', pstyle='Normal', keep_same=False)

Create a temporary run in the current context.

The run and paragraph styles will be preserved after the context manager exits. If the run is injected outside a paragraph, a temporary paragraph will be created and forgotten.

Within the context manager, both paragraph and run are guaranteed to be set to be set. run will have the style named by style, but paragraph will only have the style named by pstyle if it is a temporary paragraph.

All content is flushed into the temporary run when this manager exits.

Parameters:
  • style (str) – The style of the new run.
  • pstyle (str) – The style of a new paragraph to contain the run. Used only if paragraph is unset.
  • keep_same (bool) – If True, and a run already exists, and has the same style as this one, retain it instead of making a new one. If False (the default), always create a new run.
class imprint.core.state.ReferenceState(registry, log, heading_depth=None)

A simple container type used by the reference parser to communicate state to the reference descriptors and accumulate the reference map.

Most of the state is dedicated to monitoring referenceable tags and creating references to them. The engine and built-in tags rely on a set of attributes to function properly. A description of acceptable use of these attributes is provided here. Any other use may lead to unexpected behavior. Custom tags may define and use any attributes that are not explicitly documented as they chose.

This class allows for a containment check using in in preferece to hasattr.

registry

Mapping

A subtype of dict that follows the same rules as tag_registry. Normally a reference to that attribute.

Implemented as a read-only property.

references

ReferenceMap

A multi-level mapping type that allows references to be fetched and set by role and attribute. Access to this map is performed by providing a tuple (role, attribute, key). For example:

state.references['figure', 'id', 'my_figure']

The map’s values may be of any type, as long as they can be converted to the desired content using str.

The map is mutable at this stage in the processing. It accumulates all the referenceable tags found in the document. Setting a value for a key any of whose levels do not exist is completely acceptable: the missing levels will be filled in.

Implemented as a read-only property.

heading_depth

int

The configured depth after which heading_counter stops having an effect when a subheading is entered. If omitted entirely (None), all available heading levels will be used.

Implemented as a writable property.

heading_counter

list[int]

A list containing counters for each heading level encountered. The list is popped back one element whenever a higher level heading is encountered. len(heading_counter) is the depth of the outline the parser is currently in. E.g., if the parser is parsing text under Section 3.4.5, heading_counter contains [3, 4, 5]. When Section 4 is encountered next, the counter will be reset to [4]. The heading may be referenced later by title or by ID.

A deque is not used because it does not support slice deletion, which makes jumping back a few heading levels much easier.

Implemented as a read-only property.

item_counters

dict[str -> int]

A mapping of the :term:referenceable roles to the counters of items in the current heading. All the counters are reset to zero when a new heading below heading_depth is encountered.

Implemented as a read-only property. The keys of the mapping should not be modified, but the values may be.

content

io.StringIO

A mutable buffer used by the engine to accumulate text from the XML Template only when necessary.

This attribute should be manipulated mostly through the start_content and end_content methods. It should only be present for tags that care about accumulating content for a reference, like <par>. When present, all content, regardless of nested tags, will be accumulated.

__contains__(name)

Checks if the specified name represents an attribute.

end_content()

Terminate the current content buffer, if any, and return the content after aggressive stripping of whitespace.

If there is no content buffer to begin with, an empty string is returned.

format_heading(prefix=None, prefix_sep=' ', sep='.', suffix_sep='-', suffix=None)

Format heading_counter for display.

If suffix is set to a Truthy value, only heading_depth items are shown. Otherwise, the entire list is shown.

get_content(default='')

Retrieve the text in the current content buffer.

Whitespace is stripped from each line in the text, which is then recombined with spaces instead of newlines.

If the buffer is non-existent, empty or contains only whitespace, return default instead.

heading_counter

Ensure that heading_counter is read-only.

heading_depth

Ensure that heading_depth is set to a legitimate value.

increment_heading(level)

Increment heading_counter at the requested level.

Any missing levels are set to 1 with a warning. Any further levels are truncated. item_counters is reset if heading_depth is unset or a greater value than level.

item_counters

Ensure that item_counters is read-only.

log(lvl, msg, *args, **kwargs)

Provide access to the engine’s logging facility.

Usage is analagous to logging.log. XML location meta-data will be inserted into any log messages.

registry

Ensure that registry is read-only.

reset_counters()

Set all the values of item_counters to zero.

start_content()

Create a new content buffer.

If a buffer already exists, a warning is issued (even if it is empty), and its contents are discarded.

class imprint.core.state.ReferenceMap

A multi-level mapping that stores references in the values.

Values are accessed through a three-level key (role, attribute, key): For a given role, the type of key is determined by the attribute that names the target. Most tags only support attribute='id', but <segment-ref> also supports attribute='title'. key is the actual value of the attribute that is used to identify the reference.

Reference values can be any object whose __str__ method returns the correct replacement text for the reference.

__contains__(key)

Checks if this mapping has the specified partial key.

Key may be a single string or a tuple with a length between 1 and 3. Checks will be made for the appropriate depth.

__getitem__(key)

Retreive the value for the specified three-level key.

static __new__(cls, *args, **kwargs)

Ensure that the map is unlocked when it is first created.

This way calling __init__ is not a trick for unlocking the map.

__setitem__(key, value)

If this mapping is not locked, set the attribute for the specified three-level key.

If any of the levels are new, they are created along the way.

__str__(indent=2)

Creates a pretty representation of this map, with indented heading levels.

lock()

Lock this mapping to prevent unintentional modification.

This is a one-time operation. There is no way to unlock. After locking, __setitem__ will raise an error.

class imprint.core.state.ListType

The type of list numbering to use for <par> tags that require it.

BULLETED = 'bulleted'

Start a new bulleted list.

CONTINUED = 'continued'

Continue with the numbering/bullets of an existing list.

NUMBERED = 'numbered'

Start a new numbered list.