Welcome to Imprint’s documentation!

Introduction to Imprint

Welcome to Imprint!

What is Imprint?

Imprint is a framework for automating the generation of similarly-structured documents in MS Office Open XML format (docx). Its goal is to provide a robust, repeatable, and reliable system for generating complex content. It eliminates the issues associated with manually generating repeated content.

An example usage is hardware testing reports. The report for each component has an identical structure with all the other reports. However, the numbers, charts and tables have to be obtained from data sets specific to that hardware component.

How Does It Work?

Imprint is sturctured in a set of components arranged in layers between the user and the final document that it creates.

_images/Imprint Components.png

The components and layers of Imprint

Configuration Layer

The files in the configuration layer are provided for each report. They contain a distillation of all the differences between reports of a given type. IPC files configure the Engine Layer (the essence of Imprint itself). IDC files files direct the inputs and behavior of the Plugins Layer.

Templates Layer

Templates are static configuration files. The structure of the document, along with all static text and placeholders for generated content is laid out in the XML Template file. The styles referenced in the XML are defined in an empty Word document, the DOCX Stub.

The IIF Files files serve as a bridge between the Configuration Layer and the Templates Layer. They follow a similar keyword definition format to the configuration, but provide static content that is intended to be shared between reports. Include files are used to clean up redundancy in the Configuration Layer by aggregating static information[1].

Plugins Layer

Content-generation plugins are written to handle the data-specific content of complex tags dynamically. A well written plugin can be used across multiple document types in an organization. Plugins can generate images, tables and text with values that depend in a dynamic configuration. The type of content a plugin generates, and the interface it follows, is determined by the tag that it supports.

Concretely, plugins are Python classes (or functions) that implement the exact interface laid out by their parent tag (see Plugin API). An introduction to writing plugins is provided in the Writing Plugins tutorial. Live examples can be found throughout any Imprint deployment.

The input data and plugin behavior is defined by an IDC File in the Configuration Layer, so the same plugin can be used to generate all sorts of content based on different configurations. For example, hardware reports will generally contain tables of statistics and some sort of chart or histogram to accompany them. Having both of those plugins share data loading and preprocessing code (and usually their data configuration dictionary as well) guaranteeds consistent results.

Engine Layer

The engine is the core of Imprint that runs the entire system. It is responsible for setting up the runtime environments, ingesting all the configuration and directing the operation of all the plugins. The engine is executed through entry points in the Programs.

Output Layer

The final layer is the output. In addition to the main document, Imprint provides an enormous amount of traceability with its Logging output. The log file itself can be set up through IPC File. Both the name and the logging level are configurable. In addition to the log, all images that are generated for insertion into the document can be stored in separate files as well. This option is also configurable through the IPC File.

Historical Note

How did imprint come into being?

Around the years 2016-2018, the analysts at the Detector Characterization Lab (DCL) at NASA Goddard Space Flight Center (GSFC) working on Euclid project were creating reports of all the flight-grade SCAs[2] and SCSs[4]. These reports were on the order of around 50 pages each, contained figures and tables describing the analysis of every aspect of the testing being done on each component, and written individually by hand. Usually, the analysts would of course start with an existing report as a template, and modify the pictures, numbers and tables based on their results.

This presented a number of issues, all of which could be solved with automation. The size of the report, and the amount of data each one contained made replacing items both time consuming and error prone. This was exacerbated by the fact that the same data was used to generate multiple sets of figures, tables and text elements within a given document. And of course the number of reports being generated made it difficult to keep track of versions and templates. For one thing, it was easy to forget to update one of the figures or tables but not the other. For another, any typos that were found and corrected in the static text of the document would not always find their back to all the existing versions, and therefore possibly not into future ones either.

The reports were being used for two purposes. The long-term purpose would be to archive the detector data, so that all the test data would be available for in-flight debugging teams. In the short-term, the reports were used to communicate test results to the the customer, The European Space Agency (ESA). With this set of goals, having minor but persistent errors in the documents was deemed unacceptable, as was the amount of time being spent by qualified analysts in editing Microsoft Word documents.

A program called RepGen was created to solve most of the issues encountered with the generation of such reports. Its primary requirements were to be robust, accurate, reliable, repeatable and traceable. It placed all of the static text into an XML template making it trivial to fix typos across all reports and revisions at once. The configuration files for a particular report were structured to eliminate redundancy of information, improving traceability. Shared include files, along with a sensible structure of data was used to turn the creation of new configurations into a two-step copy-and-paste job. The content generation code was technically left up to the group using RepGen. However, code reuse was encouraged here as well, and certainly built into the basic handlers, so that consistent results could be expected from a single dataset used in multiple types of content. Plugins allowed similar types of information about different data sets to be rendered in a consistent format in multiple places in a report. All operations were logged to any level desired, including the generation of all content, so errors and inconsistencies could be found quickly and easily.

RepGen went on to become Electronic New Technology Report (eNTR) #1518805444 at NASA. Imprint is a philosophical child of RepGen. It does not share any of the old code, but it does provide a significantly improved version of the same sort of flexibility as its inspiration.

Where do I go From Here?

If you are a new user of Imprint, the recommended place to start is the Tutorials section. The Getting Started page especially will help you get a sense of how to set up an Imprint project for the first time.

The other main area of the documentation, Reference is for more advanced users. It contains the formal definitions and specifications of the interfaces used by the system.

If you are unsure where to go next, the Main Page is always a good place to start browsing through all of the available topics.

Footnotes

[1]The <expr> tag provides a more limited way to do this as well.
[2](1, 2) The Sensor Chip Array (SCA) is basically the detector chip.
[3]Sensor Chip Electronics (SCE) is the ASIC used to operate the detector.
[4]The Sensor Chip System (SCS) is the SCA[2] combined with the SCE[3].

Installation Guide

This document explains how to install Imprint.

Installing the Package

PyPI

Imprint is available via pypi, so the recommended way to install it is

pip install imprint[all]

The extra [all] installs most of the Dependencies necessary to generate simple images and tables. It can be omitted for a bare-bones install.

Source

Imprint uses setuptools, so you can install it from source as well. If you have a copy of the source distribution, run

python setup.py install

from the project root directory, with the appropriate privileges. A source distribution can be found on PyPI as well as directly on GitHub.

You can do the same thing with pip if you prefer. Any of the following should work, depending on how you obtained your distribution

pip install git+<URL>/imprint.git@master[all]  # For a remote git repository
pip install imprint.zip[all]                   # For an archived file
pip install imprint[all]                       # For an unpacked folder or repo

See the page about Dependencies for a complete description of additional software that may need to be installed. Using setup.py or pip should take care of all the Python dependencies.

Demos

Imprint is packaged with a set of demo projects intended primarily for the Tutorials. The demos are not normally installed as part of Imprint, Instead, they are to be accessed through the source repository or the documentation Documentation, once that is built. See Demos for a complete list.

Tests

Imprint does not currently have any formal unit tests available. However, running through all of the demos serves as a non-automated set of tests, since they exercise nearly every part of Imprint. Eventually, pytest-compatible tests will be added in the tests package.

Documentation

If you intend to build the documentation, you must have Sphinx installed, and optionally the ReadTheDocs Theme extension for optimal viewing. See the dependencies spec for more details.

The documentation can be built from the complete source distribution by using the specially defined command:

python setup.py build_sphinx

Alternatively (perhaps preferably), it can be built using the provided Makefile:

cd doc
make html

Both options work on Windows and Unix-like systems that have make installed. The Windows version does not require make. On Linux you can also do

make -C doc html

Building the documentation will also make a copy of the Demos.

The documentation is not present in the PyPI source distributions, only directly from GitHub.

Tutorials

The pages in the tutorials section show step-by-step instructions on how to get Imprint up and running. They cover virtually every aspect of the program from the point of view of various types of users. For a quick reference, consult the Reference documents.

For the basic user, there is the Getting Started page. It is the recommended next step for all first-time users. Once you have mastered that, Basic Tutorial will show you a more complete picture.

For the more advanced customization techniques, start with Additional Topics, Part 1, Additional Topics, Part 2, and Styles and Formatting. More advanced subjects, with coding involved, are explored in the Writing Plugins and Writing Custom Tags tutorials.

Developers can probably jump right into the Reference section with the Plugin API and the Tag API.

If you are unsure where to go next, the Main Page is always a good place to start browsing through all of the available topics.

Getting Started

If you are a first time user, you have come to the right place. This tutorial is the “Hello World!” example for Imprint. It demonstrates the most basic setup, and hopefully explains some of the possible uses for Imprint in doing so. Most of the material shown here is reiterated with more detail in the Basic Tutorial.

Creating a New Project

The easiest way to set up a new project is usually to copy an existing one. If that is not an option, create a new folder for your new project. All of the Paths in a project will be resolved relative to that folder, so it will be self-contained.

If you would like to simulate copying an existing project, download and extract the HelloWorld example. If you would like to start a new project, create a folder named HelloWorld somewhere, and follow along with the rest of this tutorial. Unless otherwise stated, all the files described below exist under the root HelloWorld folder.

Making a Template

First let’s begin by laying out the structure and content of our document in an XML Template. Our basic template for this example will look like this:

HelloWorld.xml: The document content and structure template.
1
2
3
4
5
6
7
<imprint-template>
    <par style="Normal">
        <run style="Default Paragraph Font">
            Hello <kwd name="What"/>!
        </run>
    </par>
</imprint-template>

Let us inspect the contents of this file tag-by-tag to understand what is going on.

The outermost <imprint-template> tag is necessary to make the XML into a Imprint template.

Document text is arranged into paragraphs, which are surrounded by <par> tags. Our example has only one such tag, and therefore only one paragraph. The paragraph has a Normal style. Paragraphs can contain different <run>s of character-level formatting, but it is fairly standard to have a single run with the Default Paragraph Font style. This style means that all the paragraph-level styling information is left untouched.

Finally, the innermost portion is the text of the paragraph. Our example contains two elements: the literal word Hello, and a <kwd> tag. This tag tells the Engine Layer to perform a keyword replacement. The name of the keyword is What. We will see how to define the value of What next. This value will be placed literally into the document, replacing the <kwd> tag. You can begin to imagine how this could be useful for generating multiple documents from the same template.

Note

Keep in mind that this template is very simple and easy to write. A normal Imprint template is usually quite large, and should be created only once for a large number of documents. Normally, the template will be stored outside the setup directory, where it can be accessed by many configurations.

Creating the Configuration

The second file we will create for this example is the program configuration. This file tells the Engine Layer what to do, in addition to setting up the User-Defined Keywords, like What, required by the template. Here is our configuration file:

HelloWorld.ipc The document configuration script.
1
2
3
4
5
input_xml = 'HelloWorld.xml'
output_docx = 'HelloWorld.docx'
overwrite_output = 'silent'

What = 'World'

This is a simple Python file that defines some Keywords.

Keywords starting with lowercase letters are System Keywords. input_xml and output_docx are both mandatory: Imprint will raise an error and abort immediately without them. The former references the template we just created, while the latter gives the name of the output document.

overwrite_output is an optional system keyword. It tells Imprint what to do if the output already exists. Setting it to 'silent' as we did here tells the engine to overwrite an existing output file without further ado. You can omit this keyword entirely, but the default is to raise an error if output_docx already exists.

Keywords starting with upppercase letters are User-Defined Keywords. Our example only has one user-defined keyword: What. The value of this keyword is used to replace the <kwd> tag in our XML template.

The order of keywords does not matter. You can shuffle them however you want, mix system and user defined keywords, and generally do whatever seems best. However, since this is Python code, keywords can reference each other. In that case, any keywords on the right hand side of the assignment must be defined before they are referenced for the first time.

All of the paths in configuration files are resolved relative to the folder containing the IPC File. This means that you can copy the entire folder, make some modifications to the configuration, and run it to get an entirely different and independent document.

Running Imprint

You now have a working setup. imprint is a command-line tool. You can run it by passing in a single argument: the name of the configuration file. Assuming that your current working directory is set to HelloWorld, you can generate your first document by doing

imprint HelloWorld.ipc

That’s it. you should now have a file called HelloWorld.docx. If you open it in MS Word, you will see

_images/HelloWorld Output.png

The output of our first Hello World document.

The output will be the same (and in the same place) regardless of what directory you run ipc from.

In this simple example, we did not show the use of plugins, logging, or any of the other advanced features of Imprint. Look into the other Tutorials, starting with the Basic Tutorial for additional information.

Basic Tutorial

This tutorial offers a more realistic example of how to set up a simple project from scratch than the Getting Started page. For this tutorial, we will create a somewhat contrived, but fairly polished document describing a made-up series of candle flame height measurements.

Project Setup

The files for this tutorial are available in the CandleFlame example. You may chose to download and extract the provided archive, or start with an empty folder named CandleFlame and populate it as the tutorial progresses.

For this tutorial, we will emphasize the differences between the Configuration Layer and the Templates Layer. Our templates (both XML Template and DOCX Stub), are placed in a separate folder named CandleFlame/templates. Normally, this folder would be outside the document configuration entirely, so that it can be shared by multiple documents. The IIF Files will be placed here as well, to emphasize their shared role.

Additional Topics, Part 1

This tutorial covers some of the topics not covered in the Basic Tutorial. The focus here is on the flexibility offered by the configuration files, especially the XML Template and IPC File. A basic understanding of the topics covered in the Basic Tutorial is assumed.

For a tutorial covering topics more targeted towards formatting through the DOCX Stub and plugin usage, see Additional Topics, Part 2.

Project

The project for this tutorial is Games. The discussion will only focus on the relevant portions of the relevant files, so readers are encouraged to download and extract the entire project before delving into the tutorial.

The document created in the project will contain a couple of trivial lists of board games just for illustration. The baseline version can be created using Games.ipc:

imprint Games.ipc

The output Games.docx will look something like this:

_images/Games Output.png

The output of the Games example.

Overriding Keywords

The includes system keyword is not only for IIF Files. It can also be used to modify portions of the IPC File in a traceable and repeatable manner.

Using the fact that included files can not override defined keywords, we can define an IPC File snippet that just overrides the keywords that we want, and includes everything else from the original file:

1
2
3
caption_counter_depth = 0

includes = ['Games.ipc']

The example show here is used to modify the caption_counter_depth setting. The same technique can be used equally well to modify other System Keywords as well as User-Defined Keywords. Such modification is useful for testing and to create documents that are closely related to each other in terms of most of their configuration.

Games0.ipc and its siblings Games2.ipc, Games3.ipc, and GamesNone.ipc are revisited in the section on Setting the Caption Counter Depth.

Making Lists

Lists are created by setting the list attribute of the <par> tags. List items are just regular paragraphs with some extra styling added on for bullets or numbering. A sample XML Template with list items looks like this:

Games.xml: The content and structure template, with list items emphasized.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<imprint-template>
<par style="Title"><run>Sample of Board Games</run></par>
<toc>Contents</toc>
<par style="Heading 1">
    <run>
        Gridded Games
    </run>
</par>
<par>
    <run>
        The following images show the boards used by different types of gridded
        games. <figure-ref id="checkers"/> and <figure-ref id="chess"/> show
        checkers and chess, respectively. These are some of the most common
        games played on a static pre-made board. <figure-ref id="tic_tac_toe"/>
        and <figure-ref id="battleship"/> show tic-tac-toe and battleship
        boards. These are no less ubiquitous, but are generally played on
        hand-drawn boards.
    </run>
</par>
<par style="Heading 2">
    <run>
        Board Games
    </run>
</par>
<par><run>This section lists games played on pre-made boards.</run></par>
<par list="numbered"><run>Checkers</run></par>
<figure id="checkers" handler="imprint.handlers.figure.ImageFile" />
<par list="continued"><run>Chess</run></par>
<figure id="chess" handler="imprint.handlers.figure.ImageFile" />
<par style="Heading 2">
    <run>
        Paper Games
    </run>
</par>
<par>
    <run>
        This section lists games played on hand-drawn paper "boards".
        While pre-made board versions of these games exist, they are
        traditionally played on paper. For strictly board-type games, see
        <segment-ref title="Board Games"/>.
    </run>
</par>
<par list="num"><run style="Default Paragraph Font">Tic Tac Toe</run></par>
<table id="tic_tac_toe" handler="imprint.handlers.table.CSVFile" role="figure" />
<par list="cont" style="List Number"><run>Battleship</run></par>
<figure id="battleship" handler="imprint.handlers.figure.ImageFile" />
</imprint-template>

To start a new list, set the list attribute to either numbered or bulleted. Lines 26 and 43 in the example show how this is done. The full word (which is case-insensitive by the way) can be spelled out, or any prefix of it can be used, as in line 43.

To append elements to a list, set list to continued, as in lines 28 and 45. The list will be continued regardless of how many additional paragraphs or other elements are placed between the list items. For example, the figures on line 27, 29 and 46 and table on line 44 do not break up the numbering scheme of the two lists we created:

_images/Games Output List1.png

The first list of games, with figures between list items.

_images/Games Output List2.png

The second list of games, emphasizing the restarted numbering.

Additional information is available in the List Styling tutorial.

Adding References

The text Figure12 Reference Inline, Figure34 Reference Inline and Segment1 Reference Inline are dynamically generated reference names, which are automatically derived from the position of the figure or heading in the document outline.

The figure references are created by the <figure-ref> tags on lines 12, 14 and 15:

Games.xml, lines 10-18, emphasizing the <figure-ref>s.
10
11
12
13
14
15
16
17
18
    <run>
        The following images show the boards used by different types of gridded
        games. <figure-ref id="checkers"/> and <figure-ref id="chess"/> show
        checkers and chess, respectively. These are some of the most common
        games played on a static pre-made board. <figure-ref id="tic_tac_toe"/>
        and <figure-ref id="battleship"/> show tic-tac-toe and battleship
        boards. These are no less ubiquitous, but are generally played on
        hand-drawn boards.
    </run>

Each <figure-ref> identifies the figure it refers to by its id attribute. This is how most References are identified.

The text reference on line 40 is greated by a <segment-ref> tag, which points to paragraphs. Since paragraphs do not normally have an id attribute, they can be referenced by title instead:

Games.xml, lines 39-41, emphasizing the <segment-ref>.
39
40
41
        traditionally played on paper. For strictly board-type games, see
        <segment-ref title="Board Games"/>.
    </run>

The title of a <segment-ref> is the full text of the heading or other paragraph that is being pointed to, with all the extra spaces and line-breaks removed:

Games.xml, lines 20-24, emphasizing the heading title that the <segment-ref> refers to.
20
21
22
23
24
<par style="Heading 2">
    <run>
        Board Games
    </run>
</par>

Among builtin tags, <figure> and <par> can be referenced by <figure-ref> and <segment-ref>, respectively. We have not seen <table-ref> tags in the tutorial so far, which reference <table>s. <table-ref> works just like <figure-ref>, but with “Table” in the reference name instead of “Figure”.

Setting the Caption Counter Depth

The formatting of the figure number for <figure-ref> (and table number for <table-ref>) is set by the caption_counter_depth system keyword.

The default caption_counter_depth is 1, meaning that only the top-level heading is considered when counting and naming figures. If we were to change caption_counter_depth to say 2:

Games2.ipc: Overriding the caption_counter_depth, setting the depth to 2.
1
2
3
caption_counter_depth = 2

includes = ['Games.ipc']

We would see two elements in the heading level, and the figure counter would restart with every second-level heading instead of just the top level heading. The references that previously looked like Figure12 Reference Inline and Figure34 Reference Inline now look like Figure212 Reference Inline and Figure234 Reference Inline.

This snippet provides additional illustration for Adding Include Files. We can use a similar technique to remove the heading information from the references entirely, by setting caption_counter_depth to zero:

Games0.ipc: Overriding the caption_counter_depth, setting the depth to 0.
1
2
3
caption_counter_depth = 0

includes = ['Games.ipc']

These references show the figure counter for the whole document: Figure012 Reference Inline and Figure034 Reference Inline.

The cases shown here are well-behaved. In the case where caption_counter_depth is 2, all the references live in a heading at least two deep, when it is zero, there can’t be any problems at all. But if caption_counter_depth is set to a number that is greater than the outline depth of the heading containing the reference, the missing levels are ignored:

Games3.ipc: Overriding the caption_counter_depth, setting the depth to 3.
1
2
3
caption_counter_depth = 3

includes = ['Games.ipc']

In this case, the references will look identical to the ones with caption_counter_depth set to 2.

To turn off the the truncation of captions entirely, and just count references within each nested level of subheading independently, set caption_counter_depth to None:

GamesNone.ipc: Overriding the caption_counter_depth, unsetting the depth entirely.
1
2
3
caption_counter_depth = None

includes = ['Games.ipc']

The result will be identical with the case where caption_counter_depth is 2 for this particular example as well, but in general, the heading portion of the reference will not be constrained (similar to section headings). The counter will restart for any heading that is encountered in the document.

There are plenty of other pathalogical cases out there in terms of missing heading levels. The reader is assured that Imprint handles all of them consistently, and is left with the exercise of verifying that assertion for themselves. For a starting point, see the obscure PathologicalCases project.

Using Roles

Roles allow tags to impersonate each other as reference targets. The most common usage is to turn tables or equations into figures that can be referenced as “Figure 1.3-1”, rather than being treated as a table or equation.

Our sample template creates such a <table> to describe Tic-Tac-Toe:

Games.xml, lines 43-44, emphasizing the <table> tag.
43
44
<par list="num"><run style="Default Paragraph Font">Tic Tac Toe</run></par>
<table id="tic_tac_toe" handler="imprint.handlers.table.CSVFile" role="figure" />

The reference for this table can only be performed through a <figure-ref> tag, rather than the usual <table-ref>:

Games.xml, lines 13-15, emphasizing the unusual <figure-ref> tag.
13
14
15
        checkers and chess, respectively. These are some of the most common
        games played on a static pre-made board. <figure-ref id="tic_tac_toe"/>
        and <figure-ref id="battleship"/> show tic-tac-toe and battleship

Any tag, whether it is normally referenceable or not, can impersonate a role. For example, all it takes for a <latex> equation to become a figure is the addition of an attribute: role="figure". That being said, not all roles are suitable for every tag. For example, the PathologicalCases project has an example of a <table> that plays the role of a heading with role="par". This introduces the problem that <table> should not contain text, and so normally can not be referenced by <segment-ref>’s title attribute.

Additional Topics, Part 2

This tutorial covers some of the topics not covered in the Basic Tutorial. The focus here is on how to set up proper styling through DOCX Stub and how to utilise plugins to their fullest potential. A basic understanding of the topics covered in the Basic Tutorial is assumed. A passing understanding of the concepts in Writing Plugins may be required for an in-depth understanding.

For a tutorial covering topics more targeted towards content and configuration through XML Template and IPC File, see Additional Topics, Part 1.

Project

The project for this tutorial is Invoice. The discussion will only focus on the relevant portions of the relevant files, so readers are encouraged to download and extract the entire project before delving into the tutorial.

There will also be sections that demonstrate how to work through the MS Word user interface, as well as some XML formatting in a text editor.

The document created in the project will contain a made up customer invoice, along with a letter to the customer. It will look something like this:

_images/Invoice Output.png

The output of the Invoice example.

The project uses two custom plugins and one built-in one to process the data. The plugins are implemented in invoice.py and registered in Company.iif. If you have not done so already, read through the Using Your Plugin portion of the Writing Plugins tutorial.

Image Logging

Images that are generated for the document can be “logged” by copying them into the log directory, or if conventional logging is disabled, into to the document output directory. Image logging also applies to strings, LaTeX equations, and sometimes tables (all the common handlers implement it). For common handlers that just insert images or table data as-is into a document, this is not much of an advantage. However, when a figure handler generates a complex image or chart from scratch, it is often useful to have it output to disc as well as using it from memory.

Image logging is controlled by the log_images system keyword in the IPC File:

Invoice.ipc, lines 18-19, showing the log_images setting.
18
19
log_file = True
log_images = True

Image logging is not enabled by default. With logging turned on, you will see the following additional files in your output directory:

  • Invoice_authorized_signature.png

    This is the only actual image that is logged. It is a copy of the authorization signature that is inserted by the <figure> tag in the XML Template:

    Invoice.xml, lines 57-65, emphasizing where the signature is generated.
    57
    58
    59
    60
    61
    62
    63
    64
    65
        <par style="Normal">
            <run>Kindest Regards,</run>
        </par>
        <par style="Figure Container">
            <figure id="authorized_signature" handler="imprint.handlers.figure.ImageFile"/>
        </par>
        <par style="Normal">
            <run><kwd name="AuthorizedSigner"/></run>
        </par>
    
  • Invoice_damage_assessment.txt

    This is the output of the <string> tag in the XML Template. Strings are dumped into a text file for inspection, since they are generated content, like images.

    Invoice.xml, lines 25-29, emphasizing where the custom string is inserted.
    25
    26
    27
    28
    29
        <par style="Normal">
            <run>
                <string id="damage_assessment" handler="invoice.damage_assessment"/>
            </run>
        </par>
    
  • Invoice_financial_data.csv

    This is a copy of the financial data that is used to do the damage assessment and to generate the actual invoice. It is generated in response to the <table> tag in the XML Template:

    Invoice.xml, lines 84-92, emphasizing where the invoice table is generated.
    84
    85
    86
    87
    88
    89
    90
    91
    92
        <par style="Normal">
            <run>Transaction Date: </run>
            <run style="Strong"><kwd name="InvoiceDate" format="%Y-%b-%d"/></run>
        </par>
        <table handler="invoice.invoice_table" id="financial_data" style="Plain Table 1" />
        <par style="Post Table">
            <run>Payment in full is due on </run>
            <run style="Strong"><kwd name="DueDate" format="%Y-%b-%d"/></run>
        </par>
    

    Tables are not required to dump their data unless it really makes sense to do so. Due to the relatively flexible structure of tables in Word documents, the plugin itself is responsible for how the data is to be written. Other plugins rely on the tag to do their logging for them.

    Todo

    Some of the last paragraph above probably belongs in the plugin tutorial, not here.

Line and Page Breaks

The built-in tags support two types of breaks: line and page breaks. Both are to be found in the Invoice sample project.

Line Breaks

Line breaks are placed directly in a run of text using the <n> tag:

Invoice.xml, lines 39-44, showing how line breaks are inserted.
39
40
41
42
43
44
    <par style="List Paragraph">
        <run><kwd name="AddressAttn"/><n/>
             <kwd name="Address1"/><n/>
             <kwd name="Address2"/><n/>
             <kwd name="Address3"/></run>
    </par>

The result is a single run of text, but broken over multiple lines in a controlled manner:

_images/Invoice Line Breaks.png

A made up address formatted with explicit line breaks.

Line breaks can only appear in a run of text. If they appear anywhere within a <par> tag, an attempt will be made to find or even create a suitable run for the line break. However, outside a paragraph, <n> gets ignored completely, with a warning.

Page Breaks

Unlike line breaks, page breaks can appear just about anywhere. This includes <run> and <par> tags, as well as the document root.

Page breaks are inserted with a <break> tag:

Invoice.xml, lines 63-72, showing how a page break can be used.
63
64
65
66
67
68
69
70
71
72
    <par style="Normal">
        <run><kwd name="AuthorizedSigner"/></run>
    </par>

    <break/>

    <!-- Second (Invoice) Page -->
    <par style="Title">
        <run>Customer Invoice</run>
    </par>

The page break in this example separates the signature in the preface letter from the page containing the actual customer invoice. Usually, page breaks appear between paragraphs, as in this example, but that is not a requirement.

When a page break cuts a run or paragraph in two, a new paragraph and/or run with the same style is really created on the next page.

Styles and Formatting

This section demonstrates how to apply styles and formatting to the document at every level.

Applying Styles

Paragraphs
Headings
Lists

The Making Lists tutorial explains how to create lists. For simple lists, a default paragraph style is automatically selected, based on whether the list is numbered or bulleted. Anything more complicated will require explicitly setting a style.

A good example of when to use explicit list styles is when a list item contains multiple paragraphs. Consider the following snippet:

Snippet showing an extended list.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<par list="num">
    <run>
        This is an example of a list item containing multiple
        paragraphs.
    </run>
</par>
<par style="List Continue">
    <run>The second paragraph is part of the first list item.</run>
</par>
<par list="cont">
    <run>
        The third paragraph continues the numbering of the list where
        we left off.
    </run>
</par>

The result is a multi-paragraph list item for item #1. If we had not explicitly added the same style to the middle paragraph, its indentation would not have been correct for a list item:

_images/ListStyles Output.png

Detail of an extended list continued over multiple paragraphs.

Todo

Add a big blurb about the fact that this only works because the default list styles are sensibly set in the global defaults file. If not, the most default default list style is actually not very useful (indents by 4 tabs).

Writing Plugins

Imprint is all about customization, and the Plugins Layer is the crux of that customization. But what exactly is a plugin, and how do you write one? This tutorial aims to provide a step-by-step, hands-on, introduction the types of plugins that are supported, and how to write them.

A plugin is a callable that creates the dynamic content that makes gives Imprint its power. Each plugin fulfills a particular interface, defined by the XML tag that it is bound to. There are three main types of content that can be generated by default: Figures, Tables and Strings. Custom tags that support plugins can be created as well. This advanced topic is covered in the content_tutorial.

While different types of plugins are different from each other, there are a few common features they all share. The first two arguments to each of the built-in Handlers are the dictionary of Keywords and the Data Configuration. The remaining arguments depend on the specific tag. Custom tags are not strictly required, but highly encouraged, to follow this convention.

Tables

Tables are generally the most complex type of plugin for the builtin tags, since they have to modify the document in-place as they generate their content. This leads to interesting artifacts, like partially generated tables in case of an error. Broken Figures and Strings are entirely replaced by alt-text, but tables will generally be generated up to the point where the error occurred.

It also means that the <table> tag does not handle data logging, instead leaving the task up to the discretion of individual plugins. This is very different from the simpler plugin tags like <figure> and <string>, which handle the data logging in a uniform manner, without delegation to a plugin.

Using Your Plugin

You made a plugin. Now what? How do you use it in the template you just created?

This is a two-step process. First you have to let Imprint and Python know where your plugin lives. Second, you have to refer to the plugin in your template somehow. Both steps are covered in detail in the next sections:

Registering Your Plugin

To register a plugin, you must place it in the Python Path. This is normally done with something like

import sys
sys.path.insert(0, 'path/to/plugin/module')

It is often convenient to put such a registration into a dedicated IIF File.

Todo

This has been totally changed by the ??? keyword.

Todo

Add an example

Note

Keep the import as import sys rather than from sys import path, since the latter will add a keyword to your namespace, while a module will be ignored after loading.

Referencing Your Plugin

Once a plugin is in your Python Path, you can reference it as you would any other module in your tag’s handler attribute.

Todo

Add an example

Writing Custom Tags

Writing a TagDescriptor

Todo

Add the following:

> This tutorial covers the creation of a basic XML tag. It does not delve > into the subject of tags with plugins. This advanced > topic is covered in the content_tutorial.

Making Your Tag Built-In

If you end up writing a tag that you believe is generic and useful enough to be built-in, feel free to submit a pull request or patch to the author. Be sure to include all, or at least most, of the following items:

  • A properly documented implementation of your tag in the imprint.core.tags module.
  • A proper entry in XML Template Specification.
  • At least a brief mention of your tag in at least one of the tutorials.
  • Proper tests, once that becomes a thing.

Demos

The tutorials in this documentation rely on a number of small demo projects to illustrate the features of Imprint. The projects are available for download as zip files so that the reader can follow along in the tutorial and experiment on their own. The following is a list of the available demo projects:

Reference

This part of the documentation is the specifcation of the various components and interfaces of Imprint. For examples and clearer usage instructions, consult one of the pages in the Tutorials section.

Imprint Configuration Files

This page contains a summary of the different files that users must provide to have imprint operate properly. Most of the files have their own reference pages and tutorial sections.

The different types of files are normally referred to by their extension. However, since internally files are always referenced to by their full name, none of the extensions listed here are actually mandatory. They are a default choice made for clarity and aesthetics, not functionality.

IPC File

The Imprint Program Configuration (IPC) file is the main script for a given output of imprint. It contains a set of Keywords mapped to values. Some of the keywords reference the other configuration files and configure the Engine Layer; others provide the user-defined data for content generation. The former are referred to as System Keywords, while the latter are User-Defined Keywords.

The file is written using Python syntax. Keywords are normal Python names. All the restrictions that apply to Python variable names apply to keyword names. Traditionally, System Keywords which direct the operation of the Engine Layer start with lowercase letters, while User-Defined Keywords containing per-document data start with uppercase letters. Any keyword starting with a dunder (double underscore / __) is for internal use by the configuration file, and will not be exposed to the core at all. Modules imported into the configuration will not be exposed either.

Paths

Relative paths are resolved from the directory containing the IPC File, not the current directory. This makes it easy to copy entire configurations to different locations, and have them work out of the box. It also allows a user to generate multiple documents correctly without changing directories, and generally removes any dependence on the current directory.

In particular, this applies to the following system keywords, which are expected to contain a path or paths:

IDC File

A Imprint Data Configuration (IDC) file complements the core configuration of the IPC File by supplying the data configuration mappings for the Plugins Layer. The data configuration is referenced by the data_config keyword.

Like the IPC File, the IDC File uses Python syntax. It follows a similar loading convention of removing any names starting with a dunder (double underscore / __) from the loaded namespace. Unlike IPC File, recursive includes are not allowed.

Each name in the global namespace of the IDC File corresponds to a plugin configuration. Normally, all the visible names in the file are Python dictionaries, but other mapping types are allowed.

The builtin <figure>, <table> and <string> tags support plugins. The plugins are structured so that unnecessary keys are silently ignored, making it possile to share data configuration across multiple tags. For example, a figure and a table generated from the same data set can share a data configuration, and therefore avoid the redundancy of repeated data source specs.

Configuration Names

Plugin tags in the XML Template are mapped to their configuration objects by a special attribute, usually id. The name of the attribute is set for each plugin’s descriptor.

A missing configuration aborts the generation of its particular content, but does not necessarily constitute a fatal error.

IIF Files

Imprint Include Files (IIF) have the exact same format as the main IPC File. Their purpose is to share content between multiple document configurations, using the includes keyword.

Include files are intended to supplement the main configuration file. The main configuration automatically overrides any duplicate keys that are found in the includes.

Includes may be done recursively. Since the engine does not check for infinite loops, use this feature carefully.

XML Template

The XML template defines the structure and content of the document. A full specification of the XML structure is given in XML Template Specification. Additional features can be added through the XML Tag API. The template is referenced by the input_xml keyword.

DOCX Stub

The DOCX stub is an empty template document that defines all of the styles and formatting. All the styles referenced explicitly in the XML Template, as well as the implicit default styles must exist in the stub. The stub is also responsible for setting up the page numbering, headers and footers. The stub is referenced by the input_docx keyword.

Headers and Footers

Headers and footers in the template may contain any number of keyword replacement directives in their text. These directives are the name of a keyword surrounded by curly braces ({...}), and optionally containing additional formatting instructions in the Python Format String Syntax.

Header and footer keyword replacement occurs as a post-processing step after the output document has been created and written to disk. If a date key is not defined in the IPC File, it is implicitly set to a datetime representation of the current date and time for the duration of this particular post-processing step.

Any directives that can not be replaced for any reason are left exactly as-is in the output document. The following example footer text:

Current Date: {date:%B %d, %Y} ID: {0001-0234-4455-0708}

would result in the follwing replacement:

Current Date: Jan 20, 2022 ID: {0001-0234-4455-0708}

Keywords

The engine is configured through the IPC File and the IIF Files it includes. These files supply a set of keywords associated with values. There are two types of keywords: System Keywords and User-Defined Keywords.

System Keywords

System keywords configure the behavior of the engine and plugins. System keywords are conventionally identified by the lowercase_with_underscore naming scheme.

Most system keywords are optional, with sensible defaults used in case they are omitted. There are a few mandatory keywords that will result in an error if they are not supplied:

The following is a complete listing of known system keywords. Custom tags and plugins may define additional keywords (or use existing ones for their own purposes).

caption_counter_depth

The number of elements to include in the caption of a generated figure, table or heading reference before the object number. For example, say we have a figure that is the second figure under heading 1.2.3. Let’s also say that it is the fifth figure under heading 1.2 and the 10th in the first section. In that case, the following table shows the resulting reference captions for different values of this keyword:

caption_counter_depth caption
None 1.2.3-2
0 10
1 1-10
2 1.2-5
3+ 1.2.3-2

This keyword is optional. It defaults to 1.

data_config

The name of the IDC File that configures the entire Plugins Layer.

This keyword is mandatory if the XML Template specifies content generated by plugins, and completely ignored otherwise.

date

This keyword has no special meaning. However, it is implicitly set to the result of datetime.datetime.now when headers and footers are processed, if not set explicitly to something else. This makes it simpler to include information about the time of generation into the Headers and Footers. The implicitly-defined value is not available at any point besides the final keyword replacement step for Headers and Footers.

This keyword is optional.

file_level

The minimum cutoff level to dump to the file. To dump everything to the file use logging.NOTSET, 0, or 1. The value can be a (case insensitive) string level name, a number or one of the constants in the logging module.

If log_file is missing, this level will be ignored and nothing will be written to a file.

This keyword is optional. It defaults to logging.NOTSET.

includes

A sequence of include file names. Include files can only add new keywords to the existing configuration. They do not overwrite any keywords that are already set. It is therefore important that include files are loaded in breadth-first order in the order that they appear in the sequence.

This keyword is optional.

input_docx

The name of the DOCX Stub to use as a style and formatting template in the Templates Layer. All the styles referenced explicitly by the XML Template and implicitly in the User Defaults File must be present in this file. This file must also contain all the required formatting for Headers and Footers.

This keyword is optional. The default is the empty document provided by python-docx.

input_xml

The name of the XML Template file to use a content and layout template in the Templates Layer. This template must follow the specification laid out in XML Template Specification. It may contain additional tags, loaded through the tags mapping.

This keyword is mandatory.

log_file

The name of the output file to write to. All messages with level greater than or equal to file_level will be written to the named file.

If boolean True, a file with the same name as output_docx, but with a .log extension will be created.

This keyword is optional. If omitted, a log file will not be written, and file_level is ignored.

log_format

A string that determines the contents of each line of the log file. The format of this string is the same as for the fmt attribute of a logging.Formatter. It uses % interpolation syntax, with all the logging.LogRecord attributes as valid keyword replacements.

This keyword is optional. If omitted, the log message will be formatted according to '%(asctime)s - %(name)s - %(levelname)s - %(message)s'.

log_images

Whether or not to log images in separate files, in addition to inserting them into the document. Evaluated as a boolean, regardless of the actual type of the value. It is up to individual tag handlers to respect this setting. This setting is independent of the other logger settings.

This keyword is optional. If omitted, it normally defaults to falsy, but custom tags may chose to interpret it differently.

See also

Image Logging

log_stderr

Whether or not to print log output to the standard error stream. Evaluated as a boolean, regardless of the actual type of the value. If truthy, all messages with level greater than or equal to stderr_level are written to standard error.

This keyword is optional. If omitted, it defaults to falsy, and stderr_level is ignored.

log_stdout

Whether or not to print log output to the standard output stream. Evaluated as a boolean, regardless of the actual type of the value. If truthy, all messages with level greater than or equal to stdout_level are written to standard error.

If falsy, stdout_level is ignored.

If log_stderr is set to truthy along with this keyword, then messages with a logging level greater than or equal to stderr_level will not be sent to standard output.

This keyword is optional. It defaults to truthy.

output_docx

The name of the generated document. If a file with the same name already exists, the program’s behavior is determined by the overwrite_output keyword.

This keyword is mandatory.

overwrite_output

Determines how to handle the case where the file named by output_docx already exists. The following options are recognized:

'raise'
Raise an error and abort.
'rename'
Keep prompting the user for a new file name until they select one that does not already exist. A default suggestion is generated, which can be selected automatically.
'silent'
Overwrite the existing file without further comment.
'warn'
Overwrite the existing file, but with a warning.

Any other value will trigger a fatal error.

This keyword is optional. It defaults to 'raise'.

stderr_level

The minimum threshold for messages that go to the standard error stream. This acts as an (exclusive) upper threshold for messages sent to the standard output stream as well. This level does not affect the level being logged to the file. The value can be a (case insensitive) string level name, a number or one of the constants in the logging module.

If log_stderr is missing or falsy, this level will be ignored.

This keyword is optional. It defaults to logging.ERROR.

stdout_level

The minimum threshold for messages that go to the standard output stream. This level does not affect the level being logged to the file. The value can be a (case insensitive) string level name, a number or one of the constants in the logging module.

If log_stderr is truthy, stderr_level provides the exclusive upper threshold for messages sent to standard output.

If log_stdout is falsy, this level will be ignored.

This keyword is optional. It defaults to logging.WARNING.

tags

Sets up user-defined tags for the XML Template. This is a mapping of tag names to user-defined Tag Descriptors. Values may be strings containing the fully-qualified names of the object to import, or the objects themselves. Both of the values in the following example are valid:

import my.custom.module

tags = {
    'tag1': my.cusom.module.descriptor,
    'tag2': 'my.custom.module.descriptor',
}

This keyword is optional.

User-Defined Keywords

User-defined keywords provide the data used to perform keyword replacements for the <kwd> tags in the XML Template. They provide the per-report configuration of the basic content. User-defined keywords are conventionally identified by a CamelCase naming scheme. While the naming is not strictly a requirement, any lowercase_and_underscore name is automatically reserved for use as a system keyword.

Computed Keywords

In addition to direct definition in the IPC File / IIF Files and insertion via the <kwd> tag, keywords can be computed through the <expr> tag in the XML Template. The namespace in which an <expr> tag is evaluated is the existing mapping of keywords defined up to that point. The result is a new user-defined keyword. System keywords placed in an <expr> tag are not guaranteed to work correctly. This form of computation is provided to de-clutter the IPC File, and avoid information redundancy in the frequently edited file.

XML Template Specification

The XML template used by Imprint contains the static portions of the text of the final document, along with all the placeholders for dynamically generated content.

There is no DTD or XMLNS for the template, for two reasons. All validation is done internally by the Imprint core, in a manner that is as lenient as possible. Any errors that can be forgiven, will be, with a warning and a logged message. Additionally, it is possible to use the XML Tag API to extend the capabilities of the core processor without requiring modification of a hard-coded standard.

The XML format used by Imprint does not allow namespaces. Namespace tags will be ignored with a warning, even if they are registered through the XML Tag API.

Warning

All tag and attribute names are case-sensitive. All builtin tags and attributes are lowercase. Names must appear in the XML exactly as shown in the spec.

Root

The file root is always the <imprint-template> tag. That being said, there is a proposal to make it configurable: Configurable XML Root Tag.

Attributes

Normally, each tag has a set of required and optional attributes. Omitting a required attribute immediately triggers a fatal error. Omitting an optional attribute just sets the default value when processing. Any extra attributes that are neither required nor optional are logged but otherwise completely ignored. In the tag descriptions below, all attributes are mandatory, unless suffixed by opt for “optional”.

In addition to the normal attributes that any tag may have, there are attributes that are processed by the engine itself. Currently, there is one such attribute:

role

Define the role of a tag and immediately make it referenceable. The role is the name of another tag that is referenceable by design. Among the builtin tags, <figure>, <table>, and sometimes <par> are referenceable by design. For more details on references, see the relevant section in the Tag API.

Normally, referenceable tags identify the target with an id attribute. Defining a role on a custom tag therefore implies that it must also have an id attribute in that case. Among the builtin tags, <segment-ref> is an exception, in that it requires either an id or a title. A tag with role="par" therefore does not require an id attribute. The rules for custom tags are defined similarly: the check for target identification attributes depends on what the role supports.

Tags

<break>

Insert a page-break. If placed in the middle of a run, this will be a true page break. Otherwise, this will be a section break that starts a new page.

Attributes

None

Content

No Content

<expr>

Evaluate a Python expression and create a new keyword. This tag can appear anywhere in the document. It temporarily suspends normal processing. Any text inside this tag will be evaluated as a Python expression, and the result will be assigned to the named keyword. All existing keywords, including those from prior <expr> tags, are available in the evauation namespace.

Keywords computed in this manner are treated the same as User-Defined Keywords and will be effective immediately as soon as the closing tag is reached, but not before. It is therefore common practive to put of all the expressions into the beginning of the XML Template.

The purpose of this tag is to abstract away common boiler-plate keywords that depend entirely on other keywords into the XML Template to avoid as much redundancy as possible.

System Keywords should never be set with this tag. System values may be used before the XML file is read, and may therefore not work as intended for this and other reasons.

Warning

This tag runs arbitrary Python code, with direct access to the keyword definitions. Avoid making assignments within the tag itself (even implicit ones) unless you really know what you are doing!

Warning

Any coding errors in the content of this tag will cause a fatal error.

Attributes
name : Python Identifier
The name of the new keyword to create.
importsopt : List of module names
A space-separated list of modules to import before evaluating the expression in the tag. Failed imports will be logged as an error.
Content

Text Only

<figure>

Generates a figure using the selected handler, and insert it into the document. If Image Logging is enabled, a separate file with the image will be generated as well.

Figures are referenceable through the <figure-ref> tag.

Attributes
id : Python Identifier
The name of the Data Configuration dictionary for the figure. The name must appear in the IDC File file. This is also the ID used by the <figure-ref> tag to link back to this tag.
handler : str
The full name of the figure handler class that will generate the content.
styleopt : Character Style
The name of the style of the run containing the figure. The run style can be used to position the image relative to the normal flow of text. Must be defined in the DOCX Stub and be a character style.
pstyleopt : Paragraph Style
The name of the style of the paragraph containing the figure. Must be defined in the DOCX Stub and be a paragraph style.
widthopt : int + {'in', 'px', 'cm', 'mm', 'pt', 'emu'}
The width of the figure. Units are optional, and default to inches ('in'). Suffixes can be separated from the number by optional whitespace.
heightopt : int + {'in', 'px', 'cm', 'mm', 'pt', 'emu'}
The height of the figure. Units are optional, and default to inches ('in'). Suffixes can be separated from the number by optional whitespace.

The attributes handler, style, pstyle, width and height can be overriden by keys with the same name in the Data Configuration for the figure. If neither width nor height are specified, the figure will be inserted as-is. If only one of them is specified, the figure will be scaled proportionally.

Content

No Content

<figure-ref>

Insert a reference to a <figure>, or another tag playing the role of a <figure>.

The reference will look something like Figure 1.2-1, depending on the configured heading depth and separators.

Attributes
id : Python Identifier
The id of the corresponding <figure>.
Content

No Content

<kwd>

Perform a keyword replacement. Keywords are defined as in the IPC File. The entire tag is replaced with the value of the keyword.

Attributes
name : Python Identifier
The name of the keyword to replace.
formatopt : format_spec
A format specification that can be used to convert the value into a string.
Content

No Content

<latex>

Insert a LaTeX formula into the document as an image. This tag is only available if the appropriate dependencies are installed.

Equations interrupt the current run if their run style does not match the style of the current run.

Attributes
styleopt : Character Style
The name of the style of the run containing the equation. The run style can be used to position the image relative to the normal flow of text. Must be defined in the DOCX Stub and be a character style.
pstyleopt : Paragraph Style
The name of the style to use for the equation’s paragraph, if it appears outside of an existing paragraph. Ignored if this tag appears inside a <par> tag. If used, must be defined in the DOCX Stub and be a paragraph style.
dpiopt : int
The DPI of the output image. Defaults to 96.
formatopt : Image Format
The output format, defaults to 'jpg'.
sizeopt : int or None
The text size, in points, used to render the equation. The default is to let LaTeX decide.
Content

Text Only. The text within the tag is parsed as a LaTeX equation.

<n>

Insert a line-break into the document. Line breaks only make sense within a paragraph, so this tag is ignored with a warning outside <par> tags.

Normally, this tag should appear inside a <run>. If not, the line break will be appended to the previous <run> in the current paragraph, or a new run will be created for it if it appears as the first tag.

Attributes

None

Content

No Content

<par>

Contains a paragraph of text. A paragraph is a collection of runs of differently formatted text, as well as some other elements. A paragraph can be styled with a paragraph-level style. Runs within a paragraph can have additional character-level styling that combines with or overrides the paragraph style.

Paragraphs should appear immediately under the document root to avoid warnings. Paragraphs that do not follow this (e.g., by being nested within each other), will be broken up unpredictably with a slew of warnings.

Paragraphs are automatically referenceable if they have a heading style. Non-heading paragraphs must explicitly declare their role to be par just like any non-par tag posing as a heading. References can be made using the <segment-ref> tag.

Attributes
styleopt : Paragraph Style
The name of the style to use for this paragraph. Must be defined in the DOCX Stub and be a paragraph style.
idopt : Reference ID
The ID of this paragraph, if it is being used as the target of a <segment-ref>. If an ID is not supplied, the segment can be referenced only through the title attribute of the <segment-ref>. IDs will be ignored for any non-heading paragraph without an explicit role.
listopt : { continued, bulleted , numbered }

If this paragraph is a list item, set this attribute to one of the allowed values. Options are case insensitive, and can be truncated: bullet and NUM are both examples of valid options as well.

This attribute is required to make a list item. If it is missing, the paragraph will not be bulleted/numbered, even if a list style is applied to it. continued will continue the style/numbering of the previous list item, no matter how many other items were inserted in between. The other options always start a new list with the default style determined by the list type.

list-levelopt : int
An integer between zero and infinity specifying the depth of the current list item. Numbers are generated automatically. If the paragraph immediately preceding this one is a list item, the depth is preserved by default (as is the style). Otherwise, the defalt depth for a new list is 1. Missing depth-levels get filled in automatically if the depth jumps by an increment of more than 1. Ignored if list is not set.
Content

Tags only. Any spurious text that is found will be placed into a run with the default style, along with a warning.

<run>

Contains a run of text, which is normally just characters, with optional keyword replacements. Runs are aggregated into <par> tags. A run can have a character-level style independent from all the other runs in the paragraph.

Attributes
styleopt : Character Style
The name of the style to use for this run of characters. Must be defined in the DOCX Stub and be a character style.
Content

Text and tags. Runs should always appear directly inside a <par> tag. Nested <run> will cause a fatal error. Runs outside a <par> tag will cause a warning and an implicit paragraph to be placed around them. Most other tags are allowed in a run, but may interrupt the run, to be resumed after with the same character style.

<section>

Introduces a new section into the document. Sections define the page parameters in the document. This tag begins a new section (rather than enclosing a section), which will continue until the next <section> tag or the end of the document.

Must appear outside any <par>, or a warning will be issued, and any surrounding run and paragraph will be broken, to be resumed on the following page with the same styles.

Attributes
orientationopt : { 'Portrait' , 'Landscape' }
The page orientation of this section. Values are case-insensitive.

The supported attributes for this tag may be expanded in the future.

Content

No Content

<segment-ref>

Insert a reference to a <par> with a heading style, or another tag playing the role of a heading <par>.

The reference will look something like Section 1.2-1: Title, depending on the configured prefix, heading depth and separators.

Attributes
idopt : Python Identifier
The id of the corresponding <par>.
titleopt : String
The actual text of the corresponding <par>.

One of id and title must be present. If both are present, they must refer to the same target, or a fatal error will occur.

Content

No Content

<string>

Generates a dynamic string based on the selected handler. Strings are expected to appear within a <run>. Any other location will generate a warning.

This tag is similar to <kwd>, except that it creates content based on a dynamic runtime configuration rather than just the static mapping of keywords.

Attributes
id : Python Identifier
The name of the Data Configuration dictionary for the string. The name must appear in the IDC File file.
handler : str
The full name of the string handler class that will generate the content.
Content

No Content

<table>

Generates a table using the selected handler. Tables are constructed directly in the document, so any errors generated by the handler will result in a table stub along with the alt-text being placed in the document.

Tables are stand-alone entities. If this tag appears inside a <run> or <par> tag, a warning will be logged, and the paragraph and character styles will be resumed as necessary after the table.

Tables are referenceable through the <table-ref> tag.

Attributes
id : Python Identifier
The name of the Data Configuration dictionary for the table. The name must appear in the IDC File file. This is also the ID used by the <table-ref> tag to link back to this tag.
handler : str
The full name of the table handler class that will generate the content.
styleopt : dev/analysis/features/styles/table-style
The name of the style to use for this table. Must be defined in the DOCX Stub and be a table style.
Content

No Content

<table-ref>

Insert a reference to a <table>, or another tag playing the role of a <table>.

The reference will look something like Table 1.2-1, depending on the configured heading depth and separators.

Attributes
id : Python Identifier
The id of the corresponding <table>.
Content

No Content

<toc>

Insert a Table of Contents (TOC) into the document. Must appear outside any <par>, or a warning will be issued, and any surrounding run and paragraph will be broken, to be resumed after the TOC with the same styles.

Attributes
minopt : int
The minimum heading level that the TOC supports. Defaults to 1.
maxopt : int
The maximum heading level that the TOC supports. Defaults to 3.
styleopt : Paragraph Style

The name of the style to use for the heading paragraph. Must be defined in the DOCX Stub and be a paragraph style.

The name of the style of the heading within the TOC.

Content

Text Only. The text will be aggregated without line breaks and used as the heading of the TOC. If omitted, defaults to nothing.

Extensions

Additional tags may be registered through the XML Tag API. New tags may not conflict with existing names, but otherwise have no real restrictions.

Glossary

The following terms are used frequently throughout this document:

error
A logged message that means that the current operation was aborted. The remainder of the document will still be processed.
fatal error
An error that is unrecoverable. In addition to being logged and aborting the current operation, the remainder of the document will not be processed.
Image Format
A short string indicating an image format for converstion tools. Common formats include 'jpg', 'png', 'bmp', etc. Most imprint features will default to either JPG or PNG format.
No Content
Nesting a tag or placing text in a tag that has this content description will cause a fatal error. The tag must effectively be of the form <tag/> or <tag></tag>. Whitespace is not considered to be content, so it may be present between an opening and closing tag.
referenceable
A tag is referenceable if it has a role attribute, of if it has reference functionality built into it. For more information on references, see the corresponding section in the tag API description: References.
Text Only
Nesting a tag in a tag that has this content description will cause a fatal error.

Plugin API

All complex custom content in Imprint is generated by the plugins. Plugins are implemented by special configurable callable objects called handlers that follow a specific interface, which allows them to be referenced by the appropriate tags in the XML Template.

Three types of content are supported out of the box: Figures, Tables and Strings. Each type of plugin accepts a mapping of keywords from the IPC File (and the <expr> tags in the XML Template), and a dictionary of data configuration values from the IDC File that defines the behavior of the plugin. Beyond that, each type of handler has a different interface.

In fact, any custom TagDescriptor may define its own plugin interface. What makes a tag pluggable is its reliance on a function that accepts a data configuration. This technically makes the plugin API an implementation of a very distinct part of the XML Tag API.

Data Configuration

The data configuration is the second argument to every handler. The data configuration is a mapping set for every plugin in the IDC File. The name of the configuration dictionary is in the id attribute of the corresponding <figure>, <table> or <string> placeholder tag in the XML Template. Custom tags may be registered as configurable plugins by setting the data_config attribute of their TagDescriptor.

Data configuration values can contain any type of values, as long as they are meaningful to the plugin. Plugins may require some keys to be present in the configuration, and should raise a KnownError, optionally caused by a KeyError in response to missing keys. Most plugins will require some sort of data source, such as a file name, but again, this is not required.

Some values are special, in that they can override XML attibutes used by the TagDescriptor. In particular, the handler attribute can be overridden by a key with the similar name in the IDC File. Overridable values are noted for each builtin tag in the XML Template Specification.

Handlers

Handlers are named by the handler attribute of the corresponding <figure>, <table> or <string> placeholder tag in the XML Template. The exact class name (including package) is searched for the handler. If not found, a prefix of imprint.handlers is prepended to the nominal package name.

The handler can be overridden in the Data Configuration dictionary. Normally, all configuration keys are interpreted directly by the handler. However, a special handler key will processed before a handler is found, and can override the setting in the XML. This mechanism is provided by each of the Tag Descriptors for Figures, Tables and Strings. It allows for more flexible debugging, and modification of existing templates. New Tag Descriptors can use the get_handler function to implement the same functionality, although it is not stritcly required.

Figures

Some built-in figure handler examples can be found in imprint.handlers.figure.

Handler Signature
handler(config, kwds[, output])

Generate an image based on the Data Configuration. If an output is specified, it will be a string or file-like. A string indicates an output file name, which the handler may modify and return. A file-like can be assumed to be open for binary writing, with random access enabled. It should be rewound before being returned.

Parameters:
  • config (dict) – The Data Configuration for the figure.
  • kwds (dict) – The keyword dictionary for the figure.
  • output (str or file-like or None) – The name of the output file, or the output file to save the figure to. If omitted, the output must go to an in-memory file-like object like io.BytesIO. The handler may determine the output format based on the file extension, but this is not required. Each handler should have a default format for extensionless files and omitted output.
Returns:

Either the actual output file name, or an in-memory file-like object, rewound to the beginning, containing the image. A string output will not necessarily be the input file name. It may, for example, have an extension appended to it. A None return value indicates an internal non-fatal error.

Return type:

str or file-like

Tables

Some built-in table handler examples can be found in imprint.handlers.table.

Handler Signature
handler(config, kwds, doc, style, *, image_log_name=None)

Generate a table based on the Data Configuration. The handler is responsible for generating a table of the correct size and styling it properly based on the style parameter.

This type of plugin is expected to have no return value.

Parameters:
  • config (dict) – The Data Configuration for the table.
  • kwds (dict) – The keyword dictionary for the table.
  • doc (docx.document.Document) – The document to insert the table into. The handler is responsible for invoking the add_table method.
  • style (str) – The name of the style to apply to the generated object.
  • image_log_name (str, Path-like or None) – The name of the image log to use if table data is to be logged. If log_images is off, this will be None. May be completely ignored by the handler if impractical or inappropriate to implement. The file name, if supplied, is provided without any extension.

Strings

Some built-in string handler examples can be found in imprint.handlers.string.

Handler Signature
handler(config, kwds)

Generate a string based on Data Configuration.

Parameters:
Returns:

The newly created string. A None return value indicates an internal non-fatal error.

Return type:

str

Errors

Since plugins implement a subset of the tag-processing functionality, the same rules apply to plugin errors at to generic tag errors. See Errors in the XML Tag API section.

Builtin Plugins

Imprint is packaged with a small number of pre-defined builtin plugin Handlers for general purpose use. In addition to being useful on their own, these plugins provide a starting point for advanced users wishing to write their own. Handlers are grouped into sub-packages according to the tag they support.

Figures

imprint.handlers.figure is the root package for built-in Handlers for inserting figures into a document.

All the handlers in this module are compatible with the plugin interface used by the <figure> tag. This package exposes all the handlers defined in its submodules.

imprint.handlers.figure.ImageFile(config, kwds, output=None)

Generate python-docx compatible images from image files.

Copy image files as-is, or load them into memory. Output must be to a file of the same type as the input (except for PDFs): no conversion is done, only direct copy. PDFs (identified by the '.pdf' extension) get special handling to convert them into usable images.

The following Data Configuration keys are used:

file
The (mandatory) file name containing the image.
formatted
Whether or not file is a format string that has keyword replacements in it. Defaults to truthy. Set to falsy if the name contains random opening braces.

Notes

Using this plugin with PDF files requires the poppler library mentioned in the External Programs.

Submodules

imprint.handlers.figure.images contains basic built-in Handlers for inserting images into a document. All the handlers in this module are compatible with the plugin interface used by the <figure> tag.

Tables
Submodules
Strings

imprint.handlers.string is the root package for built-in Handlers for inserting strings into a document.

All the handlers in this module are compatible with the plugin interface used by the <string> tag. This package exposes all the handlers defined in its submodules.

imprint.handlers.string.TextFile(config, kwds)

Generate a string directly from the contents of a text file.

Text files are inserted literally, with no styling information beyond that of the <string> tag that triggered the plugin. Newlines are not preserved.

The following Data Configuration keys are used:

file
The (mandatory) file name.
formatted
Whether or not file is a format string that has keyword replacements in it. Defaults to truthy. Set to falsy if the name contains random opening braces.
Submodules

imprint.handlers.string.strings contains the basic built-in Handlers for inserting strings into a document. All the handlers in this module are compatible with the plugin interface used by the <string> tag.

Utilities

imprint.handlers.utilities contains common utilities for handlers. Users wishing to write their own handlers may want to use these functions to facilitate a uniform interface. Existing handlers in this package use these functions as well.

imprint.handlers.utilities.get_key(config, kwds, key, default=None, formatted='formatted', missing_ok=True)

Retreive the value of key from the mapping config.

If key does not exist in config, return default instead.

If formatted is a string, it determines the key name that determines whether key is a format string or not (default is yes). Otherwise, it is interpreted as a boolean directly.

Parameters:
  • config (dict) – The Data Configuration dictionary to search.
  • kwds (dict) – The Keywords dictionary to use for replacements if formatted turns out to be truthy.
  • key (str) – The name of the key in config containing the required value.
  • default – The value to return if key is missing from config.
  • formatted (str or bool) – Either the name of the key to get the formatted flag from (if a string), or the flag itself. In either case, ignored if the value is not a string.
  • missing_ok (bool) – If truthy, missing values are replaced by default. Otherwise a KeyError is raised.
Returns:

  • The value in config associated with key, optionally formatted
  • with kwds.

imprint.handlers.utilities.get_file(config, kwds, key='file', default=None, formatted='formatted', missing_ok=False)

A wrapper around get_key that sets the default key to 'file' and forbids missing keys.

imprint.handlers.utilities.normalize_descriptor(descriptor, key, copy=False)

If descriptor is a mapping, return it as-is; otherwise, turn it into a value in a mapping keyed by key.

If the descriptor is returned as-is, it can optionally be copied by setting copy to True.

XML Tag API

The Imprint engine comes with a complete set of processors for the tags specified in the XML Template Specification. However, additional tags may be necessary for highly customized applications, so an API exists for defining and registering new tags. The API is defined in the imprint.core.tags module. Example usage can be found in the Writing Custom Tags tutorial.

Tag Descriptors

The tag API revolves around the TagDescriptor class. The class can be extended directly, or instantiated through a delegate object that fulfills the necessary duck-type API. Objects contain a set of attributes and two callbacks that define how to handle XML tags of a given type. All the elements are optional and have sensible default values.

Any registered object will be viewed through TagDescriptor.wrap, so it is not necessary to extend or instantiate TagDescriptor to create a working tag descriptor.

Errors

Tag descriptors may raise any type of error they deem necessary in their start and end methods. Most classes of errors will be logged and cause the application to abort. However, two special classes of errors will not cause a fatal crash:

  1. KnownError is used to flag known conditions that can be handled gracefully by the tag.
  2. OSError. Specifically, the FileNotFoundError and PermissionError subclasses are deemed to be “known errors”. If they represent a fatal condition, they should be wrapped in another exception type.

Any plugins with a dynamic Data Configuration will generally receive an alt-text placeholder where the content would normally go instead of completely aborting.

exception imprint.core.KnownError

A custom exception class that is used by the engine to indicate that a tag or plugin handler exited for a known reason.

In cases where this exception is logged, the message is printed without a stack trace.

Configuration

Tags have two types of configuration available to them. Static configuration for a given XML Template is provided through the tag attributes in the XML file. Dynamic configuration through the IDC File can be enabled to provide per-document fine-tuning.

XML Attributes

XML attributes are supplied to the start and end methods of a TagDescriptor as the second argument. The inputs are presented to both methods as a vanilla dict. The dictionary are meant to be treated as read-only, but this is not a requirement, meaning that technically start can modify what end sees. The dictionary is filtered to exclude any attributes that are not listed in the required and optional elements of the TagDescriptor.

Data Configuration

For some types of content, static configuration is not enough. To allow per-document configurations, a TagDescriptor must define a non-None data_config attribute. This attribute gives the name of the dictionary to extract from the IDC File.

start and end methods of a TagDescriptor with the data_config attribute set will receive an additional input argument containing the Data Configuration loaded from the IDC File.

The data configuration can override some of the static XML Attributes of a tag. For built-in tags, the XML Template Specification notes which attributes can be overriden. Built-in tags that support dynamic configuration are <figure>, <table> and <string>.

All built-in tags that support dynamic configuration also support a type of plugin, but this is not a requirement for custom tags.

References

A TagDescriptor is referenceable if it has a non-None reference. A reference made to a tag will be substituted by the appropriate reference text. By default reference tags have the target tag name with “-ref” appended: <figure-ref> references <figure>, <table-ref> references <table>. A notable exception is <segment-ref>, which references paragraphs (<par> tags), but only ones that have a heading style.

References are usually identified by a required id attribute. Segments can also be identified by the title of the segment, which is the aggressively trimmed collection of all the text in the text in the paragraph. For example, the title of the following XML snippet would be 'Example Heading':

<par style="Heading 3">
    <run style="Default Paragraph Font">
        Example
        Heading
    </run>
</par>

<segment-ref> tags can therefore identify their target with either a id or title attribute. User-defined tags can implement their own customized rules for identiying targets.

Roles

For the purpose of creating references, any tag may impersonate, or play the role of, any other tag using a special role attribute. This attribute is implicitly optional for every tag. It is interpreted directly by the parsers in the Engine Layer to determine the type of reference that a tag will represent.

For example, a <table> tag (or any other tag for that matter), which has role="figure" must be referenced by a <figure-ref> tag, not a <table-ref> tag, in the XML Template. That table will be a figure for the purposes of the document in question.

Any arbitrary tag can be referenced the same way with the appropriate role. Usually, such a referenceable tag will be styled appropriately, and will have the headings, captions, etc. appropriate for its role rather than its nominal tag.

A specific case is arbitrary tags that have a <par> role. Such tags are automatically referenceable by <segment-ref>. Their entire contents will be treated as the title of the heading, so the par role must be used carefully.

Registering New Tags

Once a TagDescriptor or a delegate object has been constructed, there are two main ways to get Imprint to use the descriptor for actual tag processing.

Via Configuration

In the normal course of things, Imprint will not automatically import unspecified user-defined modules. To let it know where to find tag extensions, add them by name or by reference to the IPC File to the mapping in the tags keyword. This will automatically import all the necessary modules, and register the custom descriptor under the requested tag name.

Programatically

Under the hood, tags are registered with the Imprint core simply by adding them to tag_registry:

tag_registry[name] = descriptor

The registry is a special mapping that ensures that name is a string not representing an existing tag. While it is not possible to remove or overwrite existing tags, the same descriptor can be registered under multiple names.

This method is useful mostly to users wishing to write a custom driver program for the engine. Under normal circumstances, the configuration solution will be more suitable.

Engine State

Both callbacks of a TagDescriptor accept an EngineState object as their first argument, which supports stateful tag processing. The engine state provides a mutable container for arbitrary attributes. Each TagDescriptor can add, remove and modify attributes of the state object to communicate with itself, the engine, and other tags.

As a rule, objects should prefer to delete state attributes rather than setting them to None. This meshes well with the fact that EngineState provides a containment check. For example, to check if the parser is in the middle of a run of text, descriptors should check

if 'run' in state: ...

The built-in tags and the engine use a set of attributes and methods to operate properly. Modifying these predefined attributes in a way other than explicitly documented will almost inevitably lead to unexpected behavior. Properties are used instead of simple attributes in a few cases to provide sanity checks for the supported modifications. Custom tags can add, remove and modify any additional attributes they choose. The full list of built-in attributes is available in the EngineState documentation.

The API

The imprint.core package contains the Imprint Engine Layer. The tags and state modules implement most of the functionality useful to end-users through the public XML Tag API. The parsers and utilities contain the Internal API.

The imprint.core.tags module implments the base XML Tag API, as well as the all the predefined Built-in Tag Descriptors and Reference Descriptors.

The following members are used to construct and register new tags:

imprint.core.tags.tag_registry = {}

A limited mapping type that contains all the currently registered tag descriptors.

Registering a new descriptor is as easy as doing:

tag_registry[name] = descriptor

The registry is a restricted mapping type that supports adding new elements only if they are not already registered. Existing elements can not be deleted. Deletion operations will raise a TypeError, while overwriting existing keys will raise a KeyError. Aside from that, all operations supported by dict are allowed (including things like update).

Any tag that is referenceable by design (has a valid reference attribute) will have the ReferenceDescriptor’s registration hook invoked after the tag-proper is registered.

The built-in tags are registered when the current module is imported.

imprint.core.tags.referable_tags

A convenience property to compute a list of all the keys that have a non-None reference attribute in their values.

class imprint.core.tags.TagDescriptor(delegate)

The basis of the tag API.

Instances of this class contain the information required to process a custom tag. They must contain all of the attributes listed below, with the expected types. The elements in tag_registry may be delegate objects that supply only part of the attibute set. In that case, they are wrapped in a proxy as needed at runtime, never up-front. The reason for this is twofold:

  1. There may be stateful objects registered for multiple tags, and wrapping in a proxy will not allow the tags to share state. This would not be a problem, except it would be unexpected behavior.
  2. Some of the attributes may be dynamic properties (or other descriptors). Fixing the value once would completely defeat such behavior.

Creating an occasional wrapper around a delegate is not expected to be particularly expensive, even if it had to be done for every tag encountered in the XML file. On the other hand, it allows for some very flexible behaviors. At the same time, very few instances of wrapping should occur, since most tags will be implemented by extending this class and implementing it properly. The wrap method ensures that all extensions are passed through as-is.

All the Built-in Tag Descriptors are instances of children of this class.

content

A tri-state bool flag indicating whether the tag is allowed/expected to have textual content or not. The values are interpreted as follows:

None
The tag may not have any content. It must be of the form <tag/> or <tag><otherTag>...</otherTag></tag>. Anything else will raise a fatal error. If tags is set to False, only the former form is allowed.
False
The tag should not have content, but content will not raise an error. A warning will be raised instead.
True
The tag is expected to have content, but the content may be empty.

Any value is allowed in a delegate. If defined, the value will be converted to bool if it is not None. Defaults to None if not defined.

tags

A bool indicating whether or not nested tags are allowed within this one.

Any value is allowed in a delegate. If defined, the value will be converted to bool. Defaults to True if not defined.

required

A tuple of strings containing the name of required tag attributes. A tag encountered without all of these attributes will raise an error.

In a delegate, this may be a single string, an iterable of strings, None or simply omitted. Every element of an iterable must be a string, or a TypeError is raised immediately during construction. Defaults to an empty tuple if not defined.

optional

A dictionary mapping the names of optional attributes to their default values. Optional attributes are ones that are expected to be present in processing, but have sensible defaults that can be used, meaning that they do not have to be specified explicitly in the XML Template.

In a delegate, this may be any mapping type, an iterable of strings, a single string, None or simply omitted. In the case of an iterable or individual string, all the defaults will be None. Iterables and mapping keys must be strings, or a TypeError will be raised during contruction. Defaults to an empty dict if not defined.

data_config

The name of the attribute containing the data configuration name for the tag. This should only be provided for tags that require Data Configuration. If provided, this tag will automatically be added to the required sequence.

In a delegate, this object must be an instance of str or None. Defaults to None if not defined.

reference

A ReferenceDescriptor that is only present if this type of tag can be the target of a reference.

Examples of referrable built-in tags are <figure>, <table> and sometimes <par>. Referrable tags can have an optional role attribute that changes the type of reference they represent. See the Roles description for more information.

In a delegate, this object must be an instance of ReferenceDescriptor or None. Defaults to None if not defined.

__init__(delegate)

After completion, this instance has all of the required attributes defined in the delegate, wrapped in the required types.

A reference to the delegate object is not retained. This method can be invoked multiple times. It updates the current descriptor with the attributes of the delegate, leaving undefined attributes in the delegate untouched.

static __new__(cls, *args, **kwargs)

Create an empty instance, with all required attributes set to default values.

This method is provided to allow bypassing the default __init__ in child classes. All arguments are ignored.

end(state, name, attr, *args)

Each descriptor should provide a method with this signature to process closing tags.

If implemented, this method must accept the Engine State, a tag name and a dict of attributes. Normally, the tag name is ignored since a separate descriptor is registered for each tag. The attributes are the same as those passed to start, barring any modifications made in start.

Descriptors that have a non-None data_config attribute set will receive an additional argument containing the Data Configuration.

The default implementation just logs itself.

start(state, name, attr, *args)

Each descriptor should provide a method with this signature to process opening tags.

If implemented, this method must accept the Engine State, a tag name and a dict of attributes. Normally, the tag name is ignored since a separate descriptor is registered for each tag.

Descriptors that have a non-None data_config attribute set will receive an additional argument containing the Data Configuration.

The default implementation just logs itself.

classmethod wrap(desc)

Construct a proxy from the descriptor if it isn’t already one.

This method is provided so that when TagDescriptor objects are implemented properly up front, they do not need to be wrapped in an additional layer.

If the input is a delegate, the return value will always be of the type that this method was invoked on. However, the type check will always be done agains the base TagDescriptor class.

class imprint.core.tags.BuiltinTag(delegate=None, **kwargs)

Bases: imprint.core.tags.TagDescriptor

The base class of all the built-in TagDescriptor implementations.

Custom tag implementations are welcome to use this class as a base instead of a raw TagDescriptor.

__init__(delegate=None, **kwargs)

Updates the required fields with the keywords that are passed in.

If no delegate object (or None) is supplied, bypass the default constructor (see TagDescriptor.__new__). kwargs will override any defaults and attributes set by a delegate.

Built-in Tag Descriptors

The existing tag descriptors implement the XML Template Specification:

class imprint.core.tags.BreakTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <break> tag.

end(state, name, attr)

Insert a page break into the document.

class imprint.core.tags.ExprTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <expr> tag.

Warning

This descriptor uses eval to execute arbitrary code and assign it to a new keyword. Use with extreme caution!

end(state, name, attr)

Evaluate the expression found inside the tag, and add a new entry to the state’s keywords.

The content_stack will be popped.

All errors in importing and evaluation will be propagated up and will terminate the parser.

start(state, name, attr)

Begin a new expression.

This just pushes a new content_stack entry in the state. All content until the closing tag will be evaluated as a set of Python statements.

class imprint.core.tags.FigureTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <figure> tag.

end(state, name, attr, config)

Generate and insert a figure based on the selected handler.

Figures can appear in a run, a paragraph, or on their own.

start(state, name, attr, config)

Just log the tag.

class imprint.core.tags.KwdTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <kwd> tag.

start(state, name, attr)

Find the value of the keyword in the state’s keywords and place it into the current content.

If the keyword is not found, a KeyError will be raised. If the tag has a format attribute, it is interpreted as a format_spec, and used to convert the value. If the attribute is not present, the value is converted with a simple call to str.

class imprint.core.tags.LatexTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <latex> tag.

end(state, name, attr)

Convert the equation in the text of the current tag into an image using haggis.latex_util.render_latex, and insert the image into the parent tag.

The parent can be a run or a paragraph. If the requested run style does not match the current run, the current run will be interrupted by a run containing a new picture with the requested style, and resumed afterwards. If there is no run to begin with, a new run will be created, but not stored in the run attribute of the state.

Formulas are rendered at 96dpi in JPEG format by default.

start(state, name, attr)

Begin a new LaTeX formula.

Just push a new content_stack entry into state. All content until the closing tag is evaluated as a LaTeX document.

class imprint.core.tags.NTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <n> tag.

start(state, name, attr)

Add a line break to the current run.

If not inside a run, append the break to the last run. Make a new run only at the start of a paragraph. Ignore with a warning outside of a paragraph.

class imprint.core.tags.ParTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <par> tag.

check_list(state, attr)

Validate the list attribute that is found.

Log an error if the attribute is invalid, but do not terminate processing. The attribute is simply ignored if the list is neither numbered, bulleted nor continued.

Return the type normalized to a ListType, or None if not a list item. If the type is valid, and list-level is set, it is converted to an integer.

compute_paragraph_style(state, attr, list_type)

Compute the paragraph style based on whether an explicit style is set in the attributes, and whether or not the paragraph is a list.

  1. If an explicit style is requested, return it. Otherwise:
  2. If the paragraph is not a list, return the default paragraph style. Otherwise:
  3. If the previous paragraph is a list item in the same list (i.e., the current list-level attribute is non-zero), return the style of the previous paragraph. Otherwise:
  4. Return the default list item style.
Parameters:
  • state (EngineState) – The state is used to check for the previous item’s style in case #3.
  • attr (dict) – The tag attributes, used to check for an explicitly set style as well as for a style reset with list-level = 0.
  • list_type (ListType or None) – The type of the list, if a list at all, as returned by check_list.
end(state, name, attr)

Terminate the current paragraph.

See end_paragraph in EngineState.

start(state, name, attr)

Terminate any existing paragraph, flush all text and start a new paragraph.

If the new paragraph is a list item, add the necessary metadata to it.

Issue a warning if an existing paragraph is found.

class imprint.core.tags.RunTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <run> tag.

end(state, name, attr)

Place any remaining text into the current run, and remove run attribute of state.

start(state, name, attr)

Create a new run, ensuring that there is a paragraph to go with it.

Creating a run outside a paragraph raises a warning and creates a paragraph with a default style. See imprint.core.state.EngineState.new_run.

class imprint.core.tags.SectionTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <section> tag.

start(state, name, attr)

Begin a new section in the document, optionally altering the page orientation.

class imprint.core.tags.SkipTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <skip> tag.

class imprint.core.tags.StringTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <string> tag.

end(state, name, attr, config)

Generate a string based on the appropriate handler.

If the log_images key is set to a truthy value in state. keywords, the content will also be dumped to a file.

start(state, name, attr, config)

Just log the tag.

class imprint.core.tags.TableTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <table> tag.

end(state, name, attr, config)

Generate and inserts a table based on the selected handler.

The handler creates the table directly in the document (unlike for figures, where only the final product is inserted). Any error that occurs mid-processing leaves a stub table in the document in addition to the automatically-inserted alt-text.

Tables appear on their own, outside any paragraph or run, so if a table is nested in a run or paragraph, a warning will be issued. Any interrupted run or paragraph resumes after the table with their prior styles.

start(state, name, attr, config)

Just log the tag.

class imprint.core.tags.TocTag(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <toc> tag.

end(state, name, attr)

Terminate and insert the TOC.

Gather any text that has been acquired into the heading, which will be a separate pargraph preceding the TOC.

If the TOC interrupted an existing paragraph, a new paragraph will be resumed with the same style as the original. If a run style is present as well, a run will be recreated too.

start(state, name, attr)

Create a new TOC.

Log a warning if the tag appears within a paragraph. Truncate the paragraph, and resum with the prior style. The same happens to the current run, if there is one.

class imprint.core.tags.ReferenceProcessor(delegate=None, **kwargs)

Bases: imprint.core.tags.BuiltinTag

Implements the <figure-ref> and <table-ref> tags.

This processor is not registered explicitly. It gets added by all of the target tags that use it as part of their registration process. Registering this processor under a name that does not end in '-ref' will lead to a runtime error in resolve.

end(state, name, attr)

Insert a string with the specified reference into the current content.

classmethod get_instance()

Returns a quasi-singleton instance of the current class.

This instance is not exposed directly, but it is registered by the built-in referencable tags.

resolve(state, name, attr)

Overridable operation for fetching and logging the reference that is to be inserted.

The default is to look up the reference by 'id' in the imprint.core.state.EngineState’s. references.

Used by the default implementation of end.

class imprint.core.tags.SegmentRefProcessor(delegate=None, **kwargs)

Bases: imprint.core.tags.ReferenceProcessor

Implements the <segment-ref> tag.

This is a special case of ReferenceProcessor that allows access by both title and id. It’s references always resolve to a <par> tag, or a tag playing that role.

resolve(state, name, attr)

Resolve a segment reference be either text or ID.

Either the id or title tag attribute must be present. If both are present, they must resolve to the same heading in the document or an error is raised.

Reference Descriptors

class imprint.core.tags.ReferenceDescriptor(prefix, identifiers='id')

Defines the process for creating References and using them through the appropriate tag.

References are made by processing the XML Template and mapping out any referenceable tags using the start and end methods. In the default implementation, the reference text is created by the make_reference method, invoked from end.

start and end return a boolean value to allow custom tags to be processed selectively. A return value of False from either method means that that the specific instance of the tag being processed is not a valid reference target. Normally both methods always return True, but for the builtin <par> tag, for example, an exception must be made.

References are placed into the document by a special TagDescriptor, which is generally registered along with the parent tag that contains a ReferenceDescriptor using the register method.

Current references are purely textual, rather having a dynamic field assigned to them. This is still a work in progress.

prefix

The prefix that normally gets prepended to the reference text. Used by make_reference to construct the output string. Extensions are welcome to ignore this attribute.

identifiers

A string or iterable of strings that lists the attributes that are used to identify target for this reference type. The attribute may be either required or optional for the target tag, but it must be recognized either way. This attribute is used to check for attributes on tags with a non-default role. Defaults to 'id'.

end(state, name, role, attr)

Process the closing tag for a referencable tag.

The default is to add the reference to the appropriate map in references by ID, based on the role, and log the operation. The attribute id is required.

The actual reference is created by make_reference.

Returns True if the tag is definitely a reference target, False if not.

identifiers

Ensure that identifiers is read-only.

make_reference(state, role, attr)

Returns a string refering to the specified tag in the specified role.

Keep in mind that the ReferenceDescriptor is selected based on the role, not necessarily the tag name. Therefore, the role argument should always be the “computed” role: the name of the tag should be overriden by the value of the attribute, if it was specified.

register(registry, name, descriptor)

A registration hook that is invoked when the parent TagDescriptor is registered.

The default implementation registers an additional TagDescriptor under the name name + '-ref', which replaces the <name-ref/> tag with the formatted reference. See ReferenceProcessor.

Parameters:
  • registry – The tag registry that the parent TagDescriptor is being inserted into. See tag_registry for details on the interface.
  • name (str) – The name under which the parent tag is being registered.
  • descriptor – The parent object being registered, not necessarily a TagDescriptor. The TagDescriptor.wrap method can be used to retreive the corresponding TagDescriptor if necessary.
set_reference(state, role, attribute, key, reference, duplicates=False)

Check that the reference identified by key does not already exist and set it.

Duplicate reference targets cause an error, unless duplicates is True, in which case a warning is logged and the new value is discarded.

start(state, name, role, attr)

Process the opening tag for a referencable tag.

The default is to log the tag and its role.

Returns True if the tag is a potential reference target, False if it is definitely not.

class imprint.core.tags.SegmentReferenceDescriptor(prefix, identifiers='id')

Bases: imprint.core.tags.ReferenceDescriptor

Extension of ReferenceDescriptor to accumulate heading text and allow references through the title attribute.

Used by <par> tags to create heading references.

heading_style_name_pattern

A class-level regular expression for identifying the <par> tags that represent referenceable headings.

end(state, name, role, attr)

Create a dual reference based on the title and optional ID in addition to the default logging.

identifiers

Ensure that identifiers is read-only.

make_reference(state, role, attr, title)

Add the section heading to the usual reference text.

register(registry, name, descriptor)

Register a SegmentRefProcessor for the <segment-ref> tag.

This registration hook uses a fixed name, so can only be called once.

set_reference(state, role, attribute, key, reference, duplicates=False)

Check that the reference identified by key does not already exist and set it.

Duplicate reference targets cause an error, unless duplicates is True, in which case a warning is logged and the new value is discarded.

start(state, name, role, attr)

Start accumulating content in addition to the default logging.

If an actual <par> tag is encountered (as opposed to a tag playing that role), and the heading matches Heading \d+, the current heading is incremented in the state.

If any heading tag, or any tag with role="par" is encountered, a new reference will be created. Non-heading paragraphs with no explicit role are non-referenceable. A non-heading paragraph can be made referenceable by explicitly setting the role.

Keep in mind that the title for a segment reference is accumulated from all the text in the paragraph. Use carefully with non-default tags.

Utility Functions

imprint.core.tags.get_key(key, attr, data, sentinel=None, default=None)

Resolve the value of key with respect to attr, but with the option to override by the data configuration dictionary.

If the final value is sentinel, return default instead. Return default if key is missing entirely as well. Both attr and data must be mapping types that support a get method.

imprint.core.tags.get_size(attr, data, key='size')

Convert a string, number or pre-constructed size to a docx.shared.Length object, using get_key for value resolution.

Common options for key are 'width' and 'height'.

Valid units suffixes are ", in, cm, mm, pt, emu, twip. Default when no units are specified is inches (").

imprint.core.tags.get_handler(tag, id, attr, data, logger, key='handler')

Retrieve and load the handler for the specified attribute mapping and data configuration.

If the handler can not be found, a detailed exception is logged and a KnownError is raised.

imprint.core.tags.get_and_run_handler(tag, id, attr, data, logger, args, kwargs=None, key='handler')

Load and run the handler for the specified attribute mapping and data configuration.

If the handler can not be found, a detailed exception is logged, as with get_handler.

All exceptions that occur during execution are converted into KnownError.

imprint.core.tags.compute_styles(attr, data, defaults)

Compute the required styles based on attr and data configurations.

Style keys are taken from the keys of defaults, while values provide the fallback names used if the keys do not appear in either attr or data. Similarly named keys in data will override ones in attr.

imprint.core.tags.compute_size(tag, attr, data, logger, default_width=None, width_key='width', height_key='height')

Create a dictionary with keys width and height and values that are instances of docx.shared.Length.

Values are resolved according to the rules of get_key, with width_key and height_key as the inputs. String values may contain units, and will be parsed according to get_size.

If neither key is present in either configuration (or present but set to None), set the the width to default_width. If that is None as well, return an empty dictionary.

Parser State Objects

The imprint.core.state module supplies the state objects that enable communication within the Engine Layer between the engine itself and the tags. The state is therefore crucial to the XML Tag API without being completely a part of it.

class imprint.core.state.EngineState(doc, keywords, references, log)

A simple container type used by the main parser to communicate document state to the tag descriptors.

Most of the state is dedicated to monitoring the status of the text acquisition from the XML. The engine and built-in tags rely on a set of attributes to function. A description of acceptable use of these attributes is provided here. Any other use may lead to unexpected behavior. Custom tags may define and use any attributes that are not explicitly documented as they choose.

This class allows for a containment check using in in preferece to hasattr.

doc

docx.document.Document

The document that is being built. Set once by the engine.

Implemented as a read-only property.

keywords

dict

The keywords configured for this document by the IPC File. Normally, this dictionary should be treated as read-only, but ExprTag can add new entries.

As a rule, keywords with lowercase names are system configuration options, while keywords that start with upper case letters affect document content.

Implemented as a read-only property.

references

ReferenceMap

A multi-level mapping type that allows references to be fetched by role and attribute. Access to this map is performed by providing a tuple (role, attribute, key). For example:

state.references['figure', 'id', 'my_figure']

The map’s values may be of any type, as long as they can be converted to the desired content using str.

The mapping is made immutable as soon as it becomes part of the state. The read-only lock is irreversible.

Implemented as a read-only property.

paragraph

docx.text.paragraph.Paragraph

A paragraph represents a collection of runs and other objects that make up a logical segment in a document. This attribute exists only when parsing a <par> tag. Usually set and unset by ParTag, but can be temporarily switched off and reinstated in response to other tags as well. end_paragraph deletes this attribute.
run

docx.text.run.Run

A run is a collection of characters with similar formatting within a paragraph. This attribute exists only when parsing a <run> tag. Usually set and unset by RunTag. end_paragraph deletes this attribute.
content

io.StringIO

A mutable buffer used by the engine to accumulate text from the XML Template.

Since whitespace needs to be trimmed rather aggressively from an XML file, this object gets an extra (non-standard) attribute:

content.leading_space

Indicates whether or not to prepend a space when concatenating this buffer with others. In general, the text of the first run in a paragraph is the only one that does not have this attribute set to True. This flag is set on the buffer rather than the state object itself so that buffers can be pushed and popped into the content_stack to handle nested tags.

This attribute should be manipulated mostly through the new_content, get_content and flush_run methods.

This attribute must always be present, regardless of the position within the document.

Implemented as a read-write property that can not be deleted or set to None.

content_stack

collections.deque[io.StringIO]

A stack for nested content buffers. Each buffer represents a tag containing independent content. Some tags append to the parent’s buffer, some close the current buffer to start a new one and others, such as <figure>, use a temporary buffer for their content.

The stack allows for a theoretically indefinite level of nesting of text elements. In reality, it will only contain one or two elements: the current run text and the contents of interpersed tags like <figure>.

This attribute should be maniplated through the push_content_stack and pop_content_stack methods.

This attribute may be empty, but never missing. Implemented as a read-only property.

last_list_item

docx.text.paragraph.Paragraph

List items in Word are just paragraphs with a particular style and numbering scheme. All of this information can be gathered from the previous paragraph that was assigned a concrete list numbering instance.

This attribute should never be missing. It should only be None to indicate that no prior numbered paragraph has occured in the document yet. To this end, it is implemented as a read-only property.

latex_count

int

A counter for the number of <latex> tags encountered so far. Used to generate the file name for the equations if Image Logging is enabled. Missing otherwise.

__contains__(name)

Checks if the specified name represents an attribute.

check_content_tail()

Include any remaining text in content into the last run of the last paragraph.

This ensures that paragraphs get truncated properly, and that spurious text between paragraphs is cleaned up.

A warning is issued if any non-whitepace text is found.

end_paragraph(tag=None)

Terminate the current paragraph.

Any existing run is immediately terminated. Spurious text is appended to the last available run. Both paragraph and run attributes are deleted by this method.

If there is no paragraph to terminate, this method is equivalent to calling check_content_tail.

Parameters:tag (str or None) – The name of a tag that interrupts the paragraph. If present, a warning will be issued. If omitted, no warning will be issued.
flush_run(renew=True, default='')

Flush the text buffer accumulating the current run into the document.

Text flushing aggressively removes whitespace from around individual lines. A single space character is prepended before the text if content.leading_space is True.

If not inside a run, this is a no-op.

Parameters:
  • renew (bool) – Whether or not to create a new text buffer when finished. This is generally a good idea, since the content will already be in the document, so the default is True. The new buffer has leading_space set to True.
  • default (str) – The text to insert if the current content buffer is empty. Defaults to nothing ('').
get_content(default='')

Retrieve the text in the current content buffer.

Whitespace is stripped from each line in the text, which is then recombined with spaces instead of newlines.

If the buffer is empty (or contains only whitespace), return default instead.

If the text is non-empty, and content has leading_space set to True, prepended a space.

image_log_name(id, ext='')

Create an output name to log an image (or data), for a Data Configuration with the given ID, and an optional extension.

This is the standard name-generator for any component ( tag descriptor or plugin handler) that enables image logging in response to log_images.

The base name is the result of concatenating an extension-less log_file (or output_docx if not set), with id, separated by an underscore. ext is appended as-is, if provided.

inject_par(style='Default Paragraph Font', pstyle='Normal', text='')

Insert a new paragraph into the document with the specified styles and text, and return it.

The contents of the paragraph will be a single run with the specified text. Any previously existing paragraph and run will be terminated (see end_paragraph) and reinstated with their proir styles once the new content is inserted.

Parameters:
  • style (str) – The name of the character style to use for the inserted run.
  • pstyle (str) – The name of the paragraph style to apply to the new paragraph.
  • text (str) – The optional text to insert into the new run.
Returns:

  • par (docx.text.paragraph.Paragraph) – The newly created paragraph. This will be a temporary object that is never set as paragraph.
  • run (docx.run.Run) – The newly created run. This will be a temporary object that is never set as run.

insert_picture(img, flush_existing=True, style='Default Paragraph Font', pstyle='Quote', **kwargs)

Insert an image into the current document.

Images must be inserted into a run, so the following cases are recognized:

Outside <par>
Create a new temporary Paragraph and a new Run. Neither object is retained (i.e. in paragraph and run).
Inside <par> but outside <run>
Create a new temporary Run, which will not be retained.
Inside <run>
If the requested style matches the style of the current run, it will be flushed and extended. Otherwise, the current run will be interrupted by a temporary run with the new style, and then reinstated.

It is an error to have a run outside a paragraph.

Parameters:
  • img (str or file-like) – The image can be the name of a file on disk, or an open file (including in memory files like io.BytesIO). In the latter case, the file pointer must be at the beginning of the image data.
  • style (str) – The name of the Character Style to apply to a new run.
  • pstyle (str) – The name of the Paragraph Style to apply if a new paragraph needs to be created.

Two additional keyword-only arguments can be supplied to add_picture: width and height.

interrupt_paragraph(warn=None)

A context manager for interrupting the current run/paragraph and resuming it when complete.

The current paragraph and run are ended before the body of the with block executes. They are reinstated afterwards, if they existed to begin with, with the same styles as before.

Parameters:warn (str, bool or None) – If a boolean, determines whether or not to issue a generic warning if a paragraph is actually interrupted. If a string, it is interpreted as the name of the tag that is interrupting the paragraph, and mentioned in the warning. No warning will be issued if falsy. Defaults to None.
log(lvl, msg, *args, **kwargs)

Provide access to the engine’s logging facility.

Usage is analagous to logging.log. XML location meta-data will be inserted into any log messages.

new_content(leading_space=None)

Update the content text buffer to a new, empty StringIO.

Calling this method is faster than doing a seek-truncate according to http://stackoverflow.com/a/4330829/2988730.

Parameters:leading_space (tri-state bool) – If None, copy leading_space from the current content. Otherwise, set to the provided value. The default is to copy the existing value.
new_run(tag, style='Default Paragraph Font', pstyle='Normal', check_in_par=True, keep_par=True)

Create a new run.

This method handles cases when a run is requested outside a paragraph, or inside an existing run:

  • Nested runs are forbidden, but run injection is not.
    • Existing content is flushed for injected runs.
  • Runs outside a paragraph will generate a temporary paragraph with a default style.
    • Missing paragraphs can optionally raise a warning.
    • The temporary paragraph can optionally be retained as the current paragraph.
Parameters:
  • name (str) – The name of the tag requesting the run. If there is already a run attribute present, setting name='run' will raise an error because of nesting.
  • style (str) – The name of the style to use for the new run.
  • pstyle (str) – The name of the style to use for a new paragraph, if one has to be created. Moot if there is already a paragraph attribute.
  • check_in_par (bool) – Whether or not to warn if not in a paragraph. Defaults to True.
  • keep_par (bool) – Whether or not to retain a newly created paragraph object in the paragraph attribute. Moot if there is already a paragraph attribute.
Returns:

  • par (docx.text.paragraph.Paragraph) – The paragraph that the run was added to. If keep_par is True or there was already a paragraph attribute set, this will be the paragraph attribute.
  • run (docx.run.Run) – The newly created run. This will be set to the run attribute unless there is no existing paragraph attribute, and keep_par is set to False.

Notes

Setting keep_par to False for a <run> tag outside a paragraph will cause a situation where run is set but paragraph is not. This may cause a problem for the engine, but should never arise with the builtin parsers.

number_paragraph(list_type, level)

Turn the current paragraph into a list item, and store it into last_list_item.

The exact numbering scheme depends on last_list_item, which will be updated to refer to the current paragraph when this method completes.

The following behaviors occur in response to list_type:

list_type Behavior
None Not a list paragraph. Do not set numbering or change last_list_item.
CONTINUED Same type and numbering as last_list_item. Set last_list_item.
NUMBERED Start a new numbered list. Set last_list_item.
BULLETED Start a new numbered list. Set last_list_item.
Parameters:
  • list_type (ListType or None) – The type of list to number with, if at all.
  • level (int or None) – The depth of the list indentation. None means to follow the level of the previous list item, if any, or use zero depth.
pop_content_stack()

Reinstate the previous level of the content_stack to the current content.

Calling this method on an empty stack will cause an error. The current content is completely discarded.

push_content_stack(flush=False, leading_space=False)

Temporarily create a new text buffer for the content.

If flush is True, the old buffer is flushed to the document and cleared before being pushed to the content_stack. If flush is False, the existing buffer is pushed unchanged. If the content is flushed, its leading_space attribute is set to True.

If the existing buffer is flushed, the buffer that will be reinstated when the new one is popped will have leading_space set to True.

The new buffer can have its leading_space attribute configured by the leading_space parameter, which defaults to False.

temp_run(style='Default Paragraph Font', pstyle='Normal', keep_same=False)

Create a temporary run in the current context.

The run and paragraph styles will be preserved after the context manager exits. If the run is injected outside a paragraph, a temporary paragraph will be created and forgotten.

Within the context manager, both paragraph and run are guaranteed to be set to be set. run will have the style named by style, but paragraph will only have the style named by pstyle if it is a temporary paragraph.

All content is flushed into the temporary run when this manager exits.

Parameters:
  • style (str) – The style of the new run.
  • pstyle (str) – The style of a new paragraph to contain the run. Used only if paragraph is unset.
  • keep_same (bool) – If True, and a run already exists, and has the same style as this one, retain it instead of making a new one. If False (the default), always create a new run.
class imprint.core.state.ReferenceState(registry, log, heading_depth=None)

A simple container type used by the reference parser to communicate state to the reference descriptors and accumulate the reference map.

Most of the state is dedicated to monitoring referenceable tags and creating references to them. The engine and built-in tags rely on a set of attributes to function properly. A description of acceptable use of these attributes is provided here. Any other use may lead to unexpected behavior. Custom tags may define and use any attributes that are not explicitly documented as they chose.

This class allows for a containment check using in in preferece to hasattr.

registry

Mapping

A subtype of dict that follows the same rules as tag_registry. Normally a reference to that attribute.

Implemented as a read-only property.

references

ReferenceMap

A multi-level mapping type that allows references to be fetched and set by role and attribute. Access to this map is performed by providing a tuple (role, attribute, key). For example:

state.references['figure', 'id', 'my_figure']

The map’s values may be of any type, as long as they can be converted to the desired content using str.

The map is mutable at this stage in the processing. It accumulates all the referenceable tags found in the document. Setting a value for a key any of whose levels do not exist is completely acceptable: the missing levels will be filled in.

Implemented as a read-only property.

heading_depth

int

The configured depth after which heading_counter stops having an effect when a subheading is entered. If omitted entirely (None), all available heading levels will be used.

Implemented as a writable property.

heading_counter

list[int]

A list containing counters for each heading level encountered. The list is popped back one element whenever a higher level heading is encountered. len(heading_counter) is the depth of the outline the parser is currently in. E.g., if the parser is parsing text under Section 3.4.5, heading_counter contains [3, 4, 5]. When Section 4 is encountered next, the counter will be reset to [4]. The heading may be referenced later by title or by ID.

A deque is not used because it does not support slice deletion, which makes jumping back a few heading levels much easier.

Implemented as a read-only property.

item_counters

dict[str -> int]

A mapping of the :term:referenceable roles to the counters of items in the current heading. All the counters are reset to zero when a new heading below heading_depth is encountered.

Implemented as a read-only property. The keys of the mapping should not be modified, but the values may be.

content

io.StringIO

A mutable buffer used by the engine to accumulate text from the XML Template only when necessary.

This attribute should be manipulated mostly through the start_content and end_content methods. It should only be present for tags that care about accumulating content for a reference, like <par>. When present, all content, regardless of nested tags, will be accumulated.

__contains__(name)

Checks if the specified name represents an attribute.

end_content()

Terminate the current content buffer, if any, and return the content after aggressive stripping of whitespace.

If there is no content buffer to begin with, an empty string is returned.

format_heading(prefix=None, prefix_sep=' ', sep='.', suffix_sep='-', suffix=None)

Format heading_counter for display.

If suffix is set to a Truthy value, only heading_depth items are shown. Otherwise, the entire list is shown.

get_content(default='')

Retrieve the text in the current content buffer.

Whitespace is stripped from each line in the text, which is then recombined with spaces instead of newlines.

If the buffer is non-existent, empty or contains only whitespace, return default instead.

heading_counter

Ensure that heading_counter is read-only.

heading_depth

Ensure that heading_depth is set to a legitimate value.

increment_heading(level)

Increment heading_counter at the requested level.

Any missing levels are set to 1 with a warning. Any further levels are truncated. item_counters is reset if heading_depth is unset or a greater value than level.

item_counters

Ensure that item_counters is read-only.

log(lvl, msg, *args, **kwargs)

Provide access to the engine’s logging facility.

Usage is analagous to logging.log. XML location meta-data will be inserted into any log messages.

registry

Ensure that registry is read-only.

reset_counters()

Set all the values of item_counters to zero.

start_content()

Create a new content buffer.

If a buffer already exists, a warning is issued (even if it is empty), and its contents are discarded.

class imprint.core.state.ReferenceMap

A multi-level mapping that stores references in the values.

Values are accessed through a three-level key (role, attribute, key): For a given role, the type of key is determined by the attribute that names the target. Most tags only support attribute='id', but <segment-ref> also supports attribute='title'. key is the actual value of the attribute that is used to identify the reference.

Reference values can be any object whose __str__ method returns the correct replacement text for the reference.

__contains__(key)

Checks if this mapping has the specified partial key.

Key may be a single string or a tuple with a length between 1 and 3. Checks will be made for the appropriate depth.

__getitem__(key)

Retreive the value for the specified three-level key.

static __new__(cls, *args, **kwargs)

Ensure that the map is unlocked when it is first created.

This way calling __init__ is not a trick for unlocking the map.

__setitem__(key, value)

If this mapping is not locked, set the attribute for the specified three-level key.

If any of the levels are new, they are created along the way.

__str__(indent=2)

Creates a pretty representation of this map, with indented heading levels.

lock()

Lock this mapping to prevent unintentional modification.

This is a one-time operation. There is no way to unlock. After locking, __setitem__ will raise an error.

class imprint.core.state.ListType

The type of list numbering to use for <par> tags that require it.

BULLETED = 'bulleted'

Start a new bulleted list.

CONTINUED = 'continued'

Continue with the numbering/bullets of an existing list.

NUMBERED = 'numbered'

Start a new numbered list.

Programs

Imprint comes with a set of command-line entry points to facilitate different tasks. This page is the manual for these programs.

imprint

The main program of Imprint, serving as the entry point to create documents.

Command

The same command can be run on both Linux and Windows systems. The Windows file that provides the executable has a .bat extension and delegates to the extension-less Python file:

imprint configuration
Options
configuration

imprint accepts a single argument, the IPC File to process.

docx2xml

A small utility for extracting text content out of existing Word documents.

Placeholders are inserted for every element that appears to be a table or a figure. No attempt is made to preserve the styles of those elements. Paragraph styles are preserved, as are run styles. An attempt is made to merge as many consecutive runs of the same style as possible.

This program can only operate on .docx files, not on .doc files.

Command

The same command can be run on both Linux and Windows systems. The Windows file that provides the executable has a .bat extension and delegates to the extension-less Python file:

docx2xml input[.docx] [output[.xml]]
Options
input

The input DOCX file to parse. A .docx extension will be appended to the file name if not already present. .doc extensions will only have one letter appended.

output

The output XML file to create. A .xml extension will be appended to the file name if not already present. If the name is missing entirely, the base name of input will be used, with the .docx extension replaced by .xml.

Logging

The program log is one of the outputs of Imprint. It is generated by the engine and plugins. The log provides traceability into the workings of Imprint, including plugins. As an important part of the user interaction on many levels, a separate document to describe the logging facility is merited.

Configuration

Logging is configured through the IPC File. The following keywords are used to configure the logging output:

All keywords are optional. The default is to log WARNING and worse to stdout. If log_stderr is set to True, messages with level ERROR and worse will be sent to stderr instead. In general, when both stdout_level and stderr_level are True, stdout will receive only the messages with levels greater than or equal to stdout_level, but strictly less than stderr_level.

If log_file is set to a non-empty string, all messages will be logged to it regardless of what is written to stdout and stderr.

The logging format can be controlled by log_format, which is the same type of string that can be passed in to format argument of logging.basicConfig or the fmt argument of logging.Formatter. The template is a %-interpolated format string that refers to the attributes of a logging.LogRecord by name.

Image Logging

If the keyword log_images is truthy, any images that get inserted into the document are also dumped individually to a file. The name of the images is based on the name of the log file (via log_images), or the name of the document if file logging is disabled. The figure, table or string ID is appended after an underscore, and the appropriate extension is added at the end.

Image logging is implemented individually for tags that generate content. It is currently supported for the following tags: <figure>, <latex>, <string> and ocassionally for <table>. The strings create small .txt files containing the snippets they generate. Custom tags are expected to respect image logging in a way that makes sense.

Under normal circumstances, the tag descriptor is responsible for logging images. However, in certain cases, the logging can be done by the content handler. Among the built-in tags, this is true for tables, since the variety of input data makes it pointless to generalize the type of logging required (as it is for Figures and Strings).

Logging From Tags

The XML Tag API allows users to process custom tags by implementing a TagDescriptor. Tags should use the engine core’s logging facility, provided by the log method of the EngineState. The reason for using the provided log method instead of the local logger is that it will attach information about the parser’s position in the XML file to every record.

Logging From Plugins

Unlike tag descriptors, plugin handlers are left to their own devices when it comes to logging. All of the XML location information will be available from the surrounding log records provided by the tag, so no real advantage is to be gained from providing location information. On the other hand, plugins can access the convenience methods provided by Python’s logging framework, such as debug and exception.

The standard procedure for the Builtin Plugins is to get a “private” modue-level logger, and use that throughout:

_logger = logging.getLogger(__name__)

Levels

In addition to the normal logging levels provided by the Python logging framework, Imprint sets up the following additional levels:

TRACE
Used to report on the normal activity of a tag processor or plugin that may be irrelevant for any but the most fine-grained debugging. The priority defaults to 5, which is lower than logging.DEBUG but higher than logging.NOTSET.
XTRACE
Similar to TRACE, but includes the current exception information by default. The priority defaults to 2, which is below that of TRACE.

All levels are registered with the logging framework as if they were built-in. The appropriate methods are registered with the currently configured default logger class.

Internal API

The internals of Imprint are implemented in the imprint.core package. Some of the internals are exposed to the user through the XML Tag API in imprint.core.tags and imprint.core.state. The remainder is not normally of interest to the user. However, it may be useful for developers and authors of more complex plugins to have access to the internals of the engine.

Parsers

imprint.core.parsers implements the parsers used to process the XML Template. These parsers make up the heart of the Engine Layer.

There are currently two parsers: ReferenceProcessor and TemplateProcessor. Both are instances of haggis.files.xml.SAXLoggable. The former creates a table of reference names/titles/locations/numbers that are used by the the latter.

class imprint.core.parsers.DocxParserBase

Base class that contains common functionality of the XML parsers that make up the Imprint Engine Layer.

This class is only intended to avoid code duplication. It serves no-standalone purpose whatsoever.

The XML structure is encoded in the following attributes:

tag_stack

A stack with special methods for entering a tag, exiting a tag, etc, with some structural validation. The current tag is always available via the current property. Each tag is pushed as an object containing the tag name, its (edited) attributes, whether or not it expects content and nested tags, and a flag indicating whether or not a warning has been raised for unexpected text if not. If the tag gets a data configuration, that will be referenced as well.

class imprint.core.parsers.ReferenceProcessor(heading_depth)

The SAX parser that is responsible for pre-computing all the relevant references found within the XML template.

Relevant references are any referenceable tags. This processor maintains its own reference counter based on the occurence of <figure>, <table> and other tags within <par> tags with Heading styles.

class imprint.core.parsers.TemplateProcessor(keywords, doc, references)

A parser to handle the entire document structure with the assumption that a reference mapping has already been made.

It processes all registered tags, generates all the content, replaces all necessary components such as keywords, strings and references.

Much of the processing is handled by the built-in TagDescriptors and the EngineState. The parser itself performs sanity checking of the XML structure based on the requirements specified in the descriptors. In addition to checking attributes, content and nested tags, it performs a simplistic form of XML validation.

The engine state does not get direct access to the data configuration like it does to the keywords. The data configuration is maintained directly by this class:

data_config

A dict containing all of the data configuration objects (dictionaries) loaded from the appropriate module if keywords contains a 'data_config' key providing the module file name, and None otherwise. Only document setups that actually use data configuration need to provide a configuration module.

Tag Handling
class imprint.core.parsers.RootTag

Implement the Root tag, regardless of its name.

The root tag is special because any spurious text found within it gets stashed in a special paragraph.

class imprint.core.parsers.TagStack

A deque-based stack that does some basic structural checking of the XML.

stack

The actual stack deque, implemented as a read-only property.

current

The current node. This is just the rightmost node in the stack, or None if the stack is empty. Also a read-only property.

class imprint.core.parsers.TagStackNode(name, attr, descriptor=None, config=None, open_error=False)

A structure for maintaining information about open tags for TemplateProcessor.

All of the attributes except warned are immutable, so while tempting, a namedtuple can not be used.

All attributes are passed to the constructor in the same order that they are listed here. Only the first two are required.

name

The name of the tag, not normalized in any way.

attr

A plain dict containing the required and optional attributes of the tag. This attribute is mutable and gets passed to both the start and end methods of the tag descriptor. It is not one of the XML library immutable mappings.

descriptor

The TagDescriptor object for this tag. This must always be an actual instance of the class, not a delegate object to be wrapped. Defaults to None.

config

The Data Configuration dictionary, if the descriptor calls for one, None otherwise (the default). If the descriptor has a data_config attribute set but this attribute is None, then open_error must be set to True.

open_error

Lets the closing tag know that a non-fatal error occurred on opening, so the closing tag processor should be ignored. Defaults to False.

warned

Indicates that a text content warning has already been issued for a tag that has a content flag set to False when nested text is found. Otherwise remains False. This attribute can not be set by the user on initialization.

exception imprint.core.parsers.OpenTagError

Used as a goto+label marker when processing opening tags.

As per https://stackoverflow.com/a/41768438/2988730 and https://docs.python.org/3/faq/design.html#why-is-there-no-goto

This error is raised to indicate a non-fatal error that prevents the closing tag from being processed.

Utilities

imprint.core.utilities containins general utilities to help the engine create and process docx files.

The configuration loaders in this module are potentially suitable for inclusion in the haggis library.

imprint.core.utilities.aggressive_strip(string)

Split a string along newlines, strip surrounding whitespace on each line, and recombine with a single space in place of the newlines.

imprint.core.utilities.check_fail_state(fail)

Verify that fail is one of the valid options {'raise', 'warn', 'ignore'}.

Raise a ValueError if it is not.

imprint.core.utilities.trigger_fail_state(fail, msg, error_class=<class 'ValueError'>, warn_class=<class 'UserWarning'>)

React to a failure according to the value of fail:

  • 'ignore': Do nothing
  • 'warn': Raise a warning with message msg and class warn_class (UserWarning by default).
  • 'raise': Raise an error with message msg and class error_class (ValueError by default).

Any other value of fail triggers a ValueError.

imprint.core.utilities.get_handler(handler_name)

Load the named plugin handler.

Handlers are callables that take an object ID and configuration dictionary and generate content for a specific tag like <figure>, <table> or <string>.

If the handler is not found as-is, the imprint.handlers package is prefixed to handler_name since that is where all built-in handlers live.

imprint.core.utilities.load_callable(name, package_prefix=None, magic_module_attribute=<haggis.SentinelType object>, instantiate_class=False)

Retrieve an arbitrary callable from a module

The input may be one of six things:

  1. A module with a magic_module_attribute that contains the callable.
  2. A callable that implements the correct interface.
  3. The name of a module containing the magic_module_attribute.
  4. The name of a callable.
  5. The name of a module in the package_prefix package.
  6. The name of a callable in the package_prefix package.

The correct thing is identified as leniently as possible and returned. The returned object is not guaranteed to be the correct thing, just to pass very cursory inspection (e.g., modules must have the magic attribute and any other objects must be callable)

Items 1, 3, 5 are not possible if magic_module_attribute is not specified. Items 5, 6 are not possible if package_prefix is not specified.

This method has one special case. If the object found is a class with a no-arg __init__ method and a __call__ method, an instance rather than the class object is returned. Note that class objects themselves are callable, so if you specify a class without a no-arg __init__ method or without a __call__ method, make sure that __init__ has the signature you require and returns the object that you expect.

imprint.core.utilities.substitute_headers_and_footers(doc_file_name, keywords)

Perform a keyword replacement on all valid newstyle format strings in the header and footer XML of a word document.

This operation is currently done by treating the XML as if it was a giant string. The assumption is valid but hacky, since format-like strings delimited by ‘{}’ are unlikely to appear anywhere outside <w:t> tags.

Dependencies

Python

Imprint requires Python version 3.6 or higher.

Core

The core program depends only on three libraries in addition to the built-in Python libraries:

  • python-docx: A library for creating documents in Office Open XML format.
  • lxml: An XML manipulation libarary that is also a dependency of python-docx.
  • haggis: A suite of Python utilities developed by the author of Imprint to support common functionality across multiple tools, including Imprint itself. All additional dependencies come indirectly from Haggis.

Content-generation plugins generally tend to have a much wider set of dependencies.

Documentation

This documentation is built with sphinx (version >= 1.7.1 required).

The API documentation requires the napoleon extension, which is now bundled with sphinx itself.

The default viewing experience for the documentation is provided by the ReadTheDocs Theme, which is, however, optional. If installed, a version >= 0.4.0 is recommended[1].

Plugins

There is almost no restriction on what Imprint plugin code can depend on. In fact, plugins can use a wide variety of open source tools and libraries for tasks like graphics rendering and file conversion. Both Python libraries and external programs can be dependencies for plugins, since the Python subprocess module supports running arbitrary executables. The lists below show a sample[2] of dependencies used by the builtin:

Python Packages

  • numpy: A fast array library for Python. This supports most of the data processing done in Imprint as numpy arrays are virtually ubiquitous in Python. This is a dependency of scipy and pillow.
  • scipy: A scientific computation libary for Python. In addition to enhancements to numpy, it supplies interfaces to scientific file formats such as IDL files.
  • matplotlib: A plotting and graphics library for Python. Much of the data visualization is done through this library.
  • pandas: A spreadsheet library for easily manipulating tables.
  • pillow: A graphics file library for Python. Used to import images and convert image files.
  • natsort: A small natural text-sorting algorithm for Python. It provides advanced sorting techniques that are more intuitive than plain lexicorgaphical sorting, e.g., for strings containing both text and numbers.

External Programs

  • ImageMagick: A suite of image conversion programs suitable for almost any reasonable format. Mostly the convert program is used, e.g., to create LaTeX equations for the <latex> tag.
  • Poppler: A library for manipulating PDF files. In particular the pdftoppm program is used to convert PDF files into importable images.
  • GhostScript (gs): Converts PostScript documents into importable images. This is particularly useful for dealing with some of the more flexible backends provided by matplotlib, especially when it comes to LaTeX equations.
  • LaTeX: Some implementation of LaTeX is necessary to support in-text LaTeX equations. texlive and pdflatex are examples of implementations that have been used successfully in testing on Linux systems. Only documents containing the <latex> XML tag require this.
  • dvips: A converter between DVI and PostScript formats is necessary to bridge the formats supported by latex and convert. This is only a dependency for documents that contain <latex> tags. This program is almost always bundled with reasonable LaTeX distributions.

Dependence on external programs generally represents a restriction to portability across platforms. This is often not a major issue because many standard programs are available for Linux and Mac environments, and generally, a particular coniguration of Imprint plugins will be used in a fairly static environment.

Footnotes

[1]Versions prior to 0.4.0 had issues with the alignment of line numbers to code in the tutorial examples.
[2]These lists are not exhaustive, but should cover most of the interesting items encountered in general use. All items required for the Builtin Plugins are covered.

Restrictions

While Imprint is an extremely complex and flexible system, there are in fact certain things it can not do. The following list contains the major omissions, with a brief explanation of the underlying reasons for each one:

1. Updating the TOC: each newly generated document requires the user to right-click on the empty table of contents and manually select “Update Table”. This is necessary because calculating the page number of the headers would require a rendering engine quivalent to MS Word.

2. Header and Footer Parseability: in some cases, the XML of the Headers and Footers must be massaged manually to ensure that there are no spurious run-breaks within a keyword-replacement directive. Word will sometimes chunk up text into runs when it is not strictly necessary, resulting in the need for this manual massaging. The root cause is that the python-docx library does not currently have support for headers and footers.

Development

You can contribute to imprint by providing but reports (or just usage experience), or writing code. Issues can be submitted on GitHub at https://github.com/madphysicist/imprint/issues. To contribute code, fork and clone the repository from https://github.com/madphysicist/imprint. You can modify the code as you wish, and submit a pull request through GitHub.

Branch Structure

Feature branches should be branched from dev. Accepted features should be squashed into a small number of commits. When a sufficient number of commits are made, they will be added to master, the minor version will increment, and a release candidate branch will start.

Installation

Installing the project is not strictly necessary for development. That being said, some features may be better tested when the project is installed. Developers can install their local copy for testing by running the following in the project root:

python setup.py develop

This will symlink the development project to the site packages of the current python environment. It is recommended that this command be run in a dedicated virtual environment.

Coding

Feel free to suggest and/or implement any feature that you feel is useful. The general phiposophy is to keep things modular. General purpose functions should be added to the haggis library rather than to Imprint itself.

The documentation should explain how the project works. Introduction to Imprint provides a high-level explanation of the overall structure. The Reference secion contains all the references to the individual components.

Testing

At the moment, there is no test package for Imprint. Instead, the Demos in the documentation provide good coverage of almost all available features. If you add a feature that is not already covered, please add a section to the appropriate Tutorials page, and modify the corresponding demo (or add a new one) as necessary.

If you would like to contribute a test package to Imprint, that would be wonderful.

Future Work

This section details some of the prominent features that are currently being proposed or already implemented for Imprint, but are not a part of the main baseline. This is not an exhaustive list, and does not contain any of the minor bug fixes and enhancements that come naturally with any project of this scope.

Further requests and issues should be raised on the GitHub issues page.

Configurable XML Root Tag

The name of the XML root tag can be configured through a key in the *.ipc. If the input_xml_root keyword is missing, the default will remain imprint-template.

Full MathML Support

Imprint will have full MathML support out of the box. At the moment, the details of the interface are being worked out. Currently, a <math> tag simply includes all the XML found inside it verbatim into the OOXML document structure.

Caching of Data

Rather than ensuring that the same loader is used for all datasets, as the current system does, it is better to create a cache of weak references to named datasets, with clear loading instructions by data name rather than handler name. This will improve the speed of Imprint (and is therefore not of prime importance).

User Defaults File

Create a file with user-level defaults. This will be a .imprint file in the user directory on Linux Systems. It will be a mix of default IPC File and options for hard coded default styles, as well as anything else that the user uses consistently as a fallback.

An environment variable, something like IMPRINTDEFS will allow the user to override this option, along with a -D command-line option to imprint.

Clickable Anchors

<figure-ref>, <table-ref> and especially <segment-ref> tags should be replaced with a clickable link-field in the output document. This won’t affect the printed version much, but would be a very nice feature to have.

PowerPoint Presentations

Since the python-pptx library supports a similar low-level interface to python-docx, it is possible to eventually extend Imprint to generate PowerPoint presentations. This is not a high priority because the nature of the PowerPoint medium is such that most presentations tend to be very unique. Word documents tend to be more suitable for cookie cutter generation.

PDF Documents

While this migration/support may be desirable from a portability standpoint, MS Word is fairly ubiquitous, and PDFs are not as editable. This is also a low priority item.

Default DOCX Stub

Given User Defaults File, a default docx stub will be referenced in that file, which will guarantee the existence of all the referenced styles. This allows detailed per-organization or per-project configuration of the styles that get used.

Section Tag

The <section> tag can also specify the page-break type, the margins and the gutter.

Default Plugin Prefix

Add a single default prefix to A) the config file, which would override the B) User Defaults File value. The default-default should be something like imprint.handlers.