Webucator Blog

Converting an HTML table to WordProcessingML with XSLT

This tutorial shows how to convert an HTML table to WordProcessingML using XSLT. I’ll start with a simple table with the same number of cells in each row. Then I’ll address cell merging, which makes everything much more complicated.

Simple Tables

A Simple HTML Table


My Caption
Heading 1 Heading 2 Heading 3 Heading 4
data data data data
data data data data
data data data data
data data data data

The result with some styling to show borders:

Simple HTML Table

A Simple WordProcessingML Table



	
		
		
	
	
		My Caption
	


	
		
		
	
	
	
		
			
				
			
			
				
					
				
				
					
						
					
					Heading 1
				
			
		
		
			
				
			
			
				
					
				
				
					
						
					
					Heading 2
				
			
		
		
			
				
			
			
				
					
				
				
					
						
					
					Heading 3
				
			
		
		
			
				
			
			
				
					
				
				
					
						
					
					Heading 4
				
			
		
	
	
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
	
	
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
	
	
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
	
	
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
		
			
				
			
			
				
					Data
				
			
		
	

The result with some basic Word 2010 styling:

Simple WordProcessingML Table

Comparing the HTML and WordProcessingML Tables

HTML WordProcessingML Description
<table> <w:tbl> “Root” table element
<caption> Use <w:p> before <w:tbl> Table caption
<thead>, <tbody>, <tfoot> No equivalent Row grouping elements
No equivalent <w:tblPr> Table properties
No equivalent <w:tblGrid> To define the columns
<tr> <w:tr> Table row
<th>, <td> <w:tc> Table cells
No equivalent <w:tcPr> Table cell properties

 

The XSLT (No Merging)

The XSLT for converting the above HTML table to a WordProcessingML table is not that complicated. Here it is:


	
	
		
			
		
	
	
		
		
			
				
				
			
			
				
			
		
		
		
			
				
				
			
			
			
		
	
	
		
		
	
	
		
			
		
	
	
		
			
				
			
			
				
					
				
				
					
						
					
					
				
			
		
	
	
		
			
				
			
			
				
					
				
			
		
	

Things to note:

  1. The caption in HTML is nested within the table, while in WordProcessingML the caption comes before the start of the table. So we need to pull the caption out of the table in the table template (lines 12 to 20) and ignore the caption element when applying further templates (line 28).
  2. In WordProcessingML, table header cells are handled with basic formatting tags (e.g, <w:jc w:val=”center”/> (for centering) and <w:b/> (for bold).
  3. The <w:tblGrid> element would normally have nested <w:gridCol> elements to define the width of the table columns. However, WordProcessingML allows the element to be empty and Microsoft Word 2010 is able to interpret the number of columns from the table itself and auto adjust it for the width of the content (note the <w:tcW w:w=”0″ w:type=”auto”/> elements nested within the <w:tcPr> elements).

Merging Cells

HTML and WordProcessingML use very different models for merging cells so cell merging makes the XSLT much more complicated.

HTML Model for Merging Cells

In HTML, cell merging is handled with the colspan and rowspan attributes, which can be applied to th and td elements.

An HTML Table with Cell Merging


	
My Caption
Heading 1 Heading 2 Heading 3 Heading 4
Long column Long row top
data data data
data 2 cols merged
Long row bottom

The result with some styling to show borders:

HTML Table with Cell Merging

A WordProcessingML Table with Cell Merging

Note the vMerge and gridSpan tags.


	
		
			
			
		
		
			My Caption
		
	
	
		
			
			
		
		
		
			
				
					
				
				
					
						
						
							
						
					
					
						
							
						
						Heading 1
					
				
			
			
				
					
				
				
					
						
						
							
						
					
					
						
							
						
						Heading 2
					
				
			
			
				
					
				
				
					
						
						
							
						
					
					
						
							
						
						Heading 3
					
				
			
			
				
					
				
				
					
						
						
							
						
					
					
						
							
						
						Heading 4
					
				
			
		
		
			
				
					
					
				
				
					
						Long column
					
				
			
			
				
					
					
				
				
					
						Long row top
					
				
			
		
		
			
				
					
					
				
				
			
			
				
					
				
				
					
						Data
					
				
			
			
				
					
				
				
					
						Data
					
				
			
			
				
					
				
				
					
						Data
					
				
			
		
		
			
				
					
					
				
				
			
			
				
					
				
				
					
						Data
					
				
			
			
				
					
					
				
				
					
						2 cols merged
					
				
			
		
		
			
				
					
					
				
				
			
			
				
					
					
				
				
					
						Long row bottom
					
				
			
		
	

The result with some basic Word 2010 styling:

WordProcessingML Table with Cell Merging

The XSLT (Cell Merging)

Dealing with horizontal merging (merging across columns) is pretty straightforward. Where HTML uses the colspan attribute, WordProcessingML uses a gridSpan tag with a val attribute to specify the number of merged columns. So…

<td colspan=”2″> would be <w:gridSpan w:val=”2″/> (nested within the <w:tcPr> element)

To handle this in the XSLT, we add the following to our td and th templates:


	

The Challenge

The challenge of using XSLT to convert an HTML table with merged cells to WordProcessingML is that the models for vertical cell merging are so different.

HTML uses the rowspan attribute, which is analogous to the colspan attribute.

WordProcessingML , however, uses a <w:vMerge/> element, which must be included in every cell that is included in the vertical merge. The <w:vMerge/> element in the topmost cell (w:tc) must include a w:val attribute with a value of “restart”. The subsequent <w:vMerge/> elements do not require a w:val attribute, but if they have one, the value should be “continue”.

So in WordProcessingML, each row has the same number of child cell elements. In HTML, however, there are no elements in subsequent rows that match up with the cell containing the rowspan attribute. The whole merged cell is handled with a single element.

That leaves us with the challenge of creating something from nothing. Specifically, we need to determine for each row (after the first):

  1. Does the current cell contain a rowspan attribute?
  2. Do any of the previous rows contain a cell with a rowspan attribute?
  3. If so, does the rowspan reach down to the current row?
  4. If so, in which position should we insert a cell?

The first problem is easy enough. We just add this bit of code to the td and th templates:


	

Since we’ll need to add a lot of logic to the th and td templates, let’s merge them like this (note lines 14-16 and 20-22):


	
		
			
			
				
			
			
				
			
		
		
			
				
					
				
			
			
				
					
						
					
				
				
			
		
	

Questions 2-4 above are tougher to answer. I took the approach of creating a pseudo-lookup-table to indicate where we need to insert the vMerge cells. The pseudo-lookup-table (vMergeLookUp) is a string formatted as follows:

|cellPos:rowPos||cellPos:rowPos||cellPos:rowPos||cellPos:rowPos|

The idea is that we should be able to look at any cell position in the table grid and determine from the vMergeLookUp whether we need to insert a vMerge cell in that position.

To illustrate, let’s look at the following relatively simple table:

data data data
data
data

The XSLT processor will hit five td elements. Our code needs to tell it to insert four vMerge cells: one before and one after each of the lone td elements in the second and third rows. So our vMergeLookUp string would look like this:

|1:2||1:3||3:2||3:3|

I created the vMergeLookUp using a called template, which makes use of another called template. The code is a bit complex as it involves recursively looking through preceding siblings for colspan attributes and adding their values to interpret the position at which we need to insert a vMerge cell. I’m not going to go through it here (phew, that’s a relief!).

I first create the vMergeLookUp at the thead/tbody/tfoot level and then pass it on to other templates using xsl:with-param. If your HTML table doesn’t use these row grouping elements, you can move the code to the table template.

I make use of the vMergeLookUp in two places:

  1. In the tr template, I use it to see if we need to insert a vMerge cell before the first cell.
  2. In the td | th template, I use it to see if we need to insert a vMerge cell after the current cell.

Here’s the code for the tr template (note lines 5-13):


	
	
		
		
			
				
					
					
				
				
			
		
		
			
			
		
	

And here’s the code for the td | th template (note lines 2-10 and 39-47):


	
	
	
		
			
			
		
	
	
	
		
			
			
				
			
			
				
			
		
		
			
				
					
				
			
			
				
					
						
					
				
				
					
				
			
		
	
	
		
			
				
				
			
			
		
	

Here is the complete code, in case you are dealing with this same issue (which you must be or you never would have made it this far):



	
	
		
			
		
	
	
		
		
			
				
				
			
			
				
					
				
			
		
		
		
			
				
				
			
			
			
		
	
	
		
			
		
		
			
		
		
			
		
	
	
		
		
			
			
				
					
						
						
					
					
				
			
			
				
				
			
		
	
	
		
		
		
			
				
				
			
		
		
		
			
				
				
					
				
				
					
				
			
			
				
					
						
					
				
				
					
						
							
						
					
					
						
					
				
			
		
		
			
				
					
					
				
				
			
		
	
	
		
		
			
			
				
					
						
						
					
				
				
					
						
						
						
					
				
			
		
	
	
		
		
		
		|
		
		:
		
		|
		
			
				
				
				
			
		
	
	
		
		
		
			
				
			
			
				
			
			
				
					
					
						
					
				
			
		
	

I had a tough time with this one. If anyone has a simpler way of tackling this problem, I’d love to hear about it.