Format overview

Author: Daniel Carrera

An OpenDocument is a ZIP file that contains several XML files. The exact files and directories in the archive will depend on the content of the document (e.g. images, macros, etc). A typical document, when unzipped, will have the following contents:

content.xml
META-INF/manifest.xml
meta.xml
mimetype
Pictures/
settings.xml
styles.xml

We look at these in turn:

content.xml

This is the most important file. It carries the actual content of the document (except for binary data, like images). The base format is inspired by HTML, and though far more complex, it should be reasonably legible to humans:

<text:h text:style-name="Heading_2">
  This is a title
</text:h>
<text:p text:style-name="Text_body"/>
<text:p text:style-name="Text_body">
  This is a paragraph. The formating (font,
  colour, etc.) are specified in the Text_body
  style. The empty text:p tag above is a blank
  paragraph (ie. an empty line).
</text:p>

META-INF/manifest.xml

The manifest file contains a list of all the files in the ZIP archive. The contents might look like this:

<manifest:file-entry
 manifest:media-type="image/png" 
 manifest:full-path="Pictures/10ECF14403.png"/>
<manifest:file-entry 
 manifest:media-type="text/xml" 
 manifest:full-path="content.xml"/>
<manifest:file-entry 
 manifest:media-type="text/xml" 
 manifest:full-path="styles.xml"/>

The presense of a manifest means that OpenDocument files are also JAR archives. This is another example of OpenDocument reusing well established standards instead of reinventing the wheel.

meta.xml

This file contains the file metadata. For example, Author, "Last modified by", date of last modification, etc. The contents look somewhat like this:

<meta:creation-date>
  2003-09-10T15:31:11
</meta:creation-date>
<dc:creator>Daniel Carrera</dc:creator>
<dc:date>
  2005-06-29T22:02:06
</dc:date>
<dc:language>es-ES</dc:language>
<meta:document-statistic
 meta:table-count="6" meta:object-count="0"
 meta:page-count="59" meta:paragraph-count="676"
 meta:image-count="2" meta:word-count="16701"
 meta:character-count="98757"/>

The names of the tags are taken from the Dublin Core XML standard. the date follows the ISO standard.

mimetype

This is a one-line file containing the mimetype of the file. For a text document that would be:

application/vnd.oasis.opendocument.text

Pictures/

This is a directory that contains images in common image formats such as JPEG and PNG. They are referenced from content.xml in a way similar to the <img> tag in HTML.

<draw:image   
   xlink:href="Pictures/1D67595BF2E.png" 
   xlink:type="simple" xlink:show="embed" 
   xlink:actuate="onLoad"/>

settings.xml

This includes settings such as the zoom factor or the cursor position. These are properties that are not content or layout.

styles.xml

This file contains style information. Styles include things like font size, colour, page width, and any kind of formatting.

OpenDocument provides a strong separation between content (in content.xml) and formatting (in styles.xml). The style types include:

  • Paragraph styles.
  • Page Styles.
  • Character Styles.
  • Frame Styles.
  • List styles.

In OpenDocument all formatting is done through styles. Even "manual" formatting is implemented through styles (the application dynamically makes new styles as needed).

Why use a ZIP file?

File size

The most obvious benefit is file size. Because they are compressed, OpenDocument files are normally a lot smaller than equivalent Microsoft .doc files. Furthermore, the longer the document the greater the benefit due to compression.

Cleaner XML

You can embed binary data (like images) in a way that keeps the XML code clean and understandable. You can put an image in the Pictures directory and refer to it from inside the document with the syntax:

<draw:image
   xlink:href="Pictures/12010E7CF14403.png"
   xlink:type="simple" xlink:show="embed" 
   xlink:actuate="onLoad"/>

If the format didn't use a ZIP file, it would have to embed the binary data directly into the file. That might look somewhat like this:

<image content="AAAAB3NzaC1kc3MAAACBALhIE5ZbPWD
uB44Qo/+DGECA8u1Jl4QdwubYgiweQQX4ZeD6LduuZk+HMW
bfvGpADeOAzS7Aw1nBPbp1F7AKo9LpGBwv/70dX0HE5hm5X
2JKXhzom4M2IPtv9BV7qKXvqdibltAPX6kTWS7Bp/o3krNL
zNsV6zkuMEETFz3Rmt2hAAAAFQCfLFFL0ouPHx3wKtgyeL5
aUO8W+QAAAIBf7MGYYn8ylLOdAs4LX00pQpaAEuwjYalnxy
ZUMpBnBhwjOkY0OH10m3hASu/jnTvbJchm43NK0YyvW1zCa
YGKrUllFrfh4pamr4Ov3zmoL7BUK0zGwowrD6ILd3OroNch
pAetV0YMJ2FkSfPlDuBaddMhymtWoFDLQ9QEzkbaTgAAAIA
sxyNNHT7MN8VaF8GZWLUaq+dl9rj2wgSPYHkDV/EvGqgB0e
GNEXzty3X2GqAg9z10Qj1W2Ua7FnC57kUdSG6B68Ei1Qkv/
N0yGFSZ3xbnP9hxFa2H/2DDPAftuBJT8MUZJHKttXVZ6jJ/
3aXEMBcL8eXw6kWroOR/L7NUxHpvyw"/>

(Though in reality the code would be much longer).