Chapter 3

Document Formats - PDF, doc(x), proprietary

What is the document format of your accurate and up-to-date, Gold Standard, content?

Before discussing the process of automating suitable documents, it is worth taking some time to consider different document formats and their relationship to document automation. The format of automated documents is important. The organization has invested its resources into the creation of existing documents, and will likely not want to discard all of this simply because the document automation system doesn't work with its types of documents.

In this context, the format of the documents is not necessarily the same as the format that is delivered to the customer, employee, or whomever is the ultimate recipient. Its the format in which the organization maintains its Gold Standard content, its templates; the format it uses for work-in-progress documents. The format with which the organization works on a daily basis is more important than, for example, the ultimate conversion to - and perhaps emailing of the documents as - a PDF file.

Which file formats do you want to use and which formats do you need to use?

The first consideration is the format of existing documents and templates. If these use some version or form of Microsoft Word, an immediate advantage exists. Word is a ubiquitous word processor that allows for the easy formatting of text, insertion of graphics, and text editing. Word is a de-facto industry standard. However, if the existing content is in PDF or a proprietary format, automation will be slightly more difficult. PDF and most proprietary formats do not lend themselves to easy automation of more complex documents. The features that can be taken for granted in Word become major obstacles in PDF and most proprietary document formats.

If the document contains more text one time and less text another time (for example, if sometimes it consists of two pages and other times three), Word will have no problem with this. The document re-flows automatically. However, in a PDF-based document this becomes a major problem, as PDF has no notion of paragraphs and paragraph flow. A PDF is effectively a collection of individual characters with a set position on the page. This fact is best illustrated by what happens when one attempts to copy text from a PDF document and paste it elsewhere. In the best-case scenario it becomes a cluster of separate non-formatted paragraphs that don't resemble anything seen in the PDF itself.

Page numbering, renumbering of multi-level lists when a portion of the document is removed, table of contents generation and updating, reflowing of paragraphs when a picture is removed, and many other basic tasks become major difficulties in the automation of PDF-based documents and templates.

PDFs do have some advantages. They are more suitable for highly graphical documents that were originally created in a proprietary program such as Adobe InDesign or Illustrator, QuarkXpress, and others. This type of software is often favored by designers, because it provides more options for graphics manipulation.

The Microsoft Word document format - or OOXML (Office Open XML) - is an open format that any third-party application can fully support. The format itself supports advanced text and graphics formatting. While no professional graphics designer would create magazines in Word, it would be possible, and there are many examples of beautiful MS Word documents. Often it is easier to create and maintain good-looking documents in Word than it is in professional graphic design software products or proprietary document editing software. When a document must be modified and documents and templates are in Word format, all that is required to complete the task is someone who can operate Microsoft Word.

Both PDF and DOCX (the OOXML family - DOCM, DOTM) are open and standardised formats that any document automation solution can fully support.

When considering a document automation solution, many organizations retain the document format they already use instead of transitioning to another format. Thus, if an organization already uses Word documents and templates, it may choose a Word-based automation solution. If the organization uses primarily PDF documents, it may choose a PDF-based automation solution because of that existing choice.

Some automation solutions use their own proprietary formats and provide import tools that make it impossible to load existing documents into the product. The downside of this approach is that the fidelity of the import is not always acceptable for documents with any degree of rich formatting, dynamic numbered lists, tables of contents, and non-text elements such as floating images, shapes, and diagrams. Upon import, documents typically lose all or some of those rich or dynamic or floating or positional attributes, along with their margins and other layout features. Hence, there is a massive post-conversion task to identify losses of fidelity and, if the new tool permits it, to make amendments. An additional challenge in terms of proprietary formats can arise with the emergence of differences between the way documents look in the template editing environment and the actual generated documents, whereas WYSIWYG (What You See Is What You Get) is now taken for granted with world-class products like Word.

This doesn't mean that everything about proprietary automation formats is problematic. However, there are inherent differences between the Microsoft Word document format and any proprietary format. To ensure that an imported document looks the same in another format is nearly impossible. For example. Microsoft has been working on PDF import into Word for years, and even though this feature is now quite usable, the imported documents still lack fidelity and require attention to make them acceptable. Keep in mind that this is the product of a company with nearly unlimited resources and access to some of the best developers in the world.

When an importation process results in a real-life document with 100% fidelity, it is a sign that the underlying format is the same before and after import. It may still be a Word document into which automation can now be incorperated, or it may still be a PDF that can now be embellished with automation.

Unless a need exists to switch document formats, to avoid re-work, the organization should choose the document automation solution that works with its existing document formats.

In the next section we consider various means of document automation, noting that the organization may already have everything it needs to automate its documents.

< Previous: Chapter 2

Next: Chapter 4 >


Learn More

Raptor icon

Hire agreements and leasing documentation automated at a large truck and bus manufacturer.
Learn more >

Crius icon

An energy retailer communicates with 1.5 million clients using ActiveDocs.
Learn more >

Cigna icon

Learn about a major implementation in Australia's court system.
Learn more >