Tech Tip: Exploring the Export XML Archive Format

By Nigel Cheshire

When I was a kid, I developed an unfortunate habit of taking things apart to try and find out how they worked. Toys, transistor radios, vacuum cleaners, pretty much anything mechanical or electrical had to come apart for inspection. I say “unfortunate” because, although I learned a lot, and could usually get them back together again, they didn’t always work quite as well after the disassembly/reassembly cycle. And sometimes, much to my parents’ chagrin, I couldn’t get them back together at all. If that sounds even slightly like you, you may be interested in this post.

If you’re a user of Teamstudio Export, you’ll know that the first step in the process of creating read-only, HTML-format archives of your HCL/IBM Notes and Domino databases is to create an archive of the data in XML format. If your primary objective is to create the stand-alone HTML archives to allow users to continue to access their Notes data in perpetuity, then you may not have given those XML archives much thought. But it’s helpful to understand that, because the XML archives contain everything that was in the Notes database, and the format is unlikely to change much (if at all) in the future, you probably want to keep those XML archives around for the foreseeable future.

To illustrate the point, consider this. In the roughly 18 months since we shipped the first version of Export, we have made many, many improvements to the HTML export process, each of which has added new features to the HTML archive. (We have a major new release in the works which will seriously improve the usability of the archives, but we’re keeping that under wraps for now.) In the same time period, we’ve made almost no changes to the format of the XML archives - they are what they are. So, you could take an archive that you created with Export 1.0 in February 2018 and use it to create a fully functional HTML archive with Export 2.2, including all the latest bells and whistles, with no problem.

So what is hidden within those mysterious “.tse” files that Export produces and that contain the XML archive? In fact, the single .tse file that holds the entire contents of a Notes database is nothing more than a ZIP archive of the many XML files that are contained within it. Because the XML files are all (of course) text based, and they don’t contain the view indexes that tend to bloat the database in its NSF format, they zip up pretty small. For example, the Domino Designer 9.0 Help database weighs in at 7.8 MB in NSF format, the raw XML files measure 4.1 MB and the zipped TSE archive goes down to 683 KB.

Because the TSE file is nothing more than a ZIP archive, if you want to nose around inside, all you have to do is add a .zip suffix to the filename and you can decompress it like any other ZIP file. If you take a peek inside any archive file, at the top level you’ll find four folders:

1. Data Folder

The data folder contains a file for every document in the database. These are encoded using the Notes/Domino standard DXL format, which is mostly pretty easy to decipher just by looking at it. As of the time of writing this, the document type definition for DXL is located here (for now. Keep in mind that the sale of Notes and Domino to HCL happened about a week ago, and so that content may very well move to somewhere on HCL’s site in the future!) The name of each file is based on the note id of the corresponding document.

2. Design Folder

As you might guess, the design folder holds a file for each design element, encoded in DXL and named by note id.

3. Profile Folder

The profile folder contains one file for each profile document in the database, also encoded in DXL.

4. Views Folder

The views folder contains one XML file for each view and each folder in the database. There is no standard DXL format for view data, and so we have defined our own format, which is documented in the Export online doc.

In addition to these folders, you will find five files at the top level of the directory tree:

1. acl.dxl - a file containing the ACL information in DXL format;

2. db.dxl - this captures database level information, such as the replica id;

3. log.txt - a plain text file containing any errors or warnings that occurred during the archive process;

4. meta.xml - this is an XML format file containing metadata which is used by Export primarily to maintain the UI;

5. unidindex.txt - this is a plain text CSV file that maps NoteIDs to UniversalNoteIDs (UNIDs) and allows Export to convert between the two during the HTML export process.

And that’s it. If you’re curious, unzip an archive and take a look. Will you ever need to know any of this information? Possibly not, but if you’re anything like me, you want to know how something works as much as how to operate it. And in this case, you don’t even need to take anything apart.