Creating MS Word reports with Java

If your application stores a lots of user content, it seems to be an obvious idea to offer some template mechanism, which can be used by your customers to create reports. So, the user can get a printable, styled and easy to process version of the interesting information out of your system, designed by himself and filled with data with just one click. Of course, creating and styling those templates should be easy enough for your customer to do it by himself… so, whats about using the well-known MS Word application for it? The user creates a document with his beloved office suite from Redmond, containing formatted text, images and charts and finally adds also some placeholders. Afterwards he passes this document to your application and, whenever the user wants to create a report, you magically fill the placeholders with the actual data from your system. Reporting can be so easy … if you can find some way to manipulate the word document the correct way with the tools your platform offers!

For our ECM software REWOO Scope we’ve also decided to add such a Word template support, but what seemed like a little feature was at the end much more work than expected, because we didn’t find any working example for doing this job with a java / groovy solution on the web. Maybe we can improve the situation for others, by presenting our solutions here.

POI doesn’t rock for classic doc

The first thing you have to decide is, which format of Word files you want to support. We started with the classic, binary doc format (also called “MS Word 97 – 2003” format) and tried to use the Apache POI Scratchpad for this. As far as you just define plain text placeholders in a certain format (e.g. ${YOUR_PLACEHOLDER}) the task seems to be pretty easy to accomplish:

/**
/ The class can be used, to create a new word file out of an existing template.
/ Placeholders must be of format ${ENTRY_NAME} and must be inserted as plain text to a classic "doc"-file
**/
class ClassicWordTemplateRenderer {
    private static final Pattern PLACEHOLDER_PATTERN = Pattern.compile('\\$\\{([^\\}]*)\\}');
    private final Logger logger = Logger.getLogger(ClassicWordTemplateRenderer.class)

    public byte[] execute(File templateFile, SomeDAOClass someDAO) {
        // Temporarily created file input stream will be closed inside POIFSFileSystem
        POIFSFileSystem fileSystem = new POIFSFileSystem(new FileInputStream(templateFile))
        HWPFDocument template = new HWPFDocument(fileSystem)

        List placeholders = findPlaceholders(template)
        if (placeholders.size() == 0) {
        // Just copy the file if there are no placeholders in it
            return org.apache.commons.io.FileUtils.readFileToByteArray(templateFile)
        }

        ByteArrayInputStream templateCopy = copyTemplateToMemory(templateFile)
        fileSystem = new POIFSFileSystem(templateCopy)
        HWPFDocument targetDocument = new HWPFDocument(fileSystem)
        Range range = targetDocument.getRange()

        Map placeholdersToValues = mapPlaceholdersToValues(placeholders, someDAO)
        for (Map.Entry placeholderToValue : placeholdersToValues) {
            range.replaceText(placeholderToValue.key, placeholderToValue.value)
        }

        ByteArrayOutputStream outputStream = null
        try {
            outputStream = new ByteArrayOutputStream()
            targetDocument.write(outputStream)
        } finally {
            outputStream.close()
        }
        return outputStream.toByteArray()
    }

    private Map mapPlaceholdersToValues(List placeholders, SomeDAOClass someDAO) {
        Map result = [:]
        for (String placeholder : placeholders) {
            // Trim the leading ${ and the trailing } characters
            String barePlaceholder = placeholder[2..-2]
            String value = someDAO.getValue(barePlaceholder)
            if (value == null) {
                logger.info("Did not find any value for placeholder ${placeholder}. Skip it!")
            } else {
                result[placeholder] = valueFormatterFactory.create(value.entry, value).format(value.entry, value)
            }
        }
        return result
    }

    private List findPlaceholders(HWPFDocument document) {
        org.apache.poi.hwpf.usermodel.Range range = document.getRange();
        String documentContent = range.text()

        Matcher matcher = PLACEHOLDER_PATTERN.matcher(documentContent)
        List placeholders = []
        while (matcher.find()) {
            placeholders.add(matcher.group(0))
        }
        return placeholders
    }

    private ByteArrayInputStream copyTemplateToMemory(File templateFile) {
        return new ByteArrayInputStream(org.apache.commons.io.FileUtils.readFileToByteArray(templateFile))
    }
}

After opening the Word file, we search for all placeholders, collect our internally stored values for them and use the build-in replace-functionality of POIs’ Range object. The downside of this solution appears, when you try to pass more complex documents (e.g. containing some images) to the execute-method. In this case, the rendered output file will not contain any of the images of the source file. This is, because POI must create a new file and doesn’t support the addition of pictures yet. Of course, everything else which is not supported by POI will also not be added to the file.

docx4j makes docx okay

So, this solution did not satisfy us: when the user has the possibility to pass a self-designed Word file to your application, you can be pretty sure that he adds most of the fancy stuff the WYSIWYG interface offers him. So we looked out for another solution. After some unsuccessful attempts with using the docx API of Apache POI (we did not find a proper way to replace some text without screwing up the output file) we finally ended up with a library called docx4j. It covers the new WordML format exclusively, so it can be used for docx files only. The API offers an utility class called MailMerger (which hasn’t anything to do with e-mails, but with the classic mail merge functionality of a word processor), which can be used to replace place holder fields added to the word document (via “Insert” -> “Quick parts” -> “Field” -> “Merge Field”). This leads us the following solution:

class XMLWordTemplateRenderer {
    // Merge fields can either be inside w:fldSimple tags or can be "complex fields" (with a special rsid-definition).
    // In the later case the field definition should be within a "instrText" tag
    private static final String XPATH = '//w:instrText[starts-with(text(),\" MERGEFIELD \")] | //w:fldSimple[starts-with(@w:instr, \" MERGEFIELD \")]';
    private static final Pattern MERGEFIELD_PATTERN = Pattern.compile(' MERGEFIELD ([^\\\\]*)');

    private final Logger logger = Logger.getLogger(XMLWordTemplateRenderer.class)

    @Override
    public byte[] execute(File templateFile, SomeDAOClass someDAO) {
        LoadFromZipNG loader = new LoadFromZipNG();
        WordprocessingMLPackage wordprocessingMLPackage = (WordprocessingMLPackage)loader.get(new FileInputStream(templateFile));
        MainDocumentPart documentPart = wordprocessingMLPackage.getMainDocumentPart();

        List allMergeFieldNames = []
        allMergeFieldNames.addAll(detectMergeFieldsWithinHeaderAndFooter(documentPart, wordprocessingMLPackage.getDocumentModel()))
        allMergeFieldNames.addAll(detectMainDocumentMergeFields(documentPart))
        Map<DataFieldName, String> mergeFieldsToValues = [:]
        for (String mergeFieldName : allMergeFieldNames) {
	    String value = someDAO.getValue(mergeFieldName.trim())
            if (value) {
                mergeFieldsToValues.put(new DataFieldName(mergeFieldName), value)
            } else {
                mergeFieldsToValues.put(new DataFieldName(mergeFieldName), "UNDEFINED!")
            }
        }
        MailMerger.performMerge(wordprocessingMLPackage, mergeFieldsToValues, true)

        SaveToZipFile saver = new SaveToZipFile(wordprocessingMLPackage)
        ByteArrayOutputStream result = new ByteArrayOutputStream()
        try {
            saver.save(result)
        } finally {
            result.close()
        }
        return result.toByteArray()
    }

    private List detectMainDocumentMergeFields(MainDocumentPart mainDocumentPart) {
        List mergeFieldNames = []

        for (Object jaxbNode : mainDocumentPart.getJAXBNodesViaXPath(XPATH, false)) {
            String mergeFieldName = detectMergeFieldName(jaxbNode)
            if (mergeFieldName) {
                mergeFieldNames.add(mergeFieldName)
            }
        }
        return mergeFieldNames;
    }

    private List detectMergeFieldsWithinHeaderAndFooter(MainDocumentPart mainDocumentPart, DocumentModel documentModel) {
        if (documentModel.getSections().size() == 0) {
            return []
        }
        List mergeFieldNames = []

        SectionWrapper sectionWrapper = documentModel.getSections().get(0)
        SectPr sectPr = sectionWrapper.getSectPr()
        for (CTRel rel : sectPr.getEGHdrFtrReferences()) {
            String relId = rel.getId();
            JaxbXmlPart part = (JaxbXmlPart) mainDocumentPart.getRelationshipsPart().getPart(relId);
            FieldsPreprocessor.complexifyFields(part);
            if (part instanceof JaxbXmlPartXPathAware) {
                for (Object jaxbNode : part.getJAXBNodesViaXPath(XPATH, false)) {
                    String mergeFieldName = detectMergeFieldName(jaxbNode)
                    if (mergeFieldName) {
                        mergeFieldNames.add(mergeFieldName)
                    }
                }
            }
        }
        return mergeFieldNames
    }

    private String detectMergeFieldName(Object jaxbNode) {
        String mergeFieldName = null
        String textToParse = null
        if (jaxbNode instanceof Text) {
            textToParse = ((Text) jaxbNode).getValue()
        }
        if (jaxbNode instanceof CTSimpleField) {
            CTSimpleField simpleField = (CTSimpleField) jaxbNode
            textToParse = ((CTSimpleField) simpleField).getInstr()
        }
        if (textToParse != null) {
            Matcher matcher = MERGEFIELD_PATTERN.matcher(textToParse)
            if (matcher.find() && matcher.groupCount() == 1) {
                mergeFieldName = matcher.group(1)
                mergeFieldName = removeTrailingOrLeadingWhitespaces(mergeFieldName)
                mergeFieldName = handleWordsWithInnerWhitespaces(mergeFieldName)
            }
        }
        return mergeFieldName
    }

    private String removeTrailingOrLeadingWhitespaces(String mergeFieldName) {
        return mergeFieldName.trim()
    }

    private String handleWordsWithInnerWhitespaces(String mergeFieldName) {
        int words = mergeFieldName.split(" ").findAll { it.length() > 0 }.size()
        if (words == 0 || words == 1) {
            return mergeFieldName
        }
        // Remove leading / trailing "-character
        mergeFieldName = mergeFieldName.replaceFirst('\\"', '')
        mergeFieldName = mergeFieldName.reverse().replaceFirst('\\"', '').reverse()
        return mergeFieldName
    }
}

What exactly are we doing here? The first few lines are really just boilerplate code to extract the content of the WordML document (the docx file itself is just a zipped folder, containing a lot of XML files and resources, the document is made of). Afterwards we extract the merge fields from the document, by using a XPATH-expression. Normaly we would expect the API to offer us some helper method to do exactly this job, but if there is any build-in possibility, we did not find it. So we decided to search manually for the merge fields. The problem is, that there are two flavors of merge fields within the XML, which must be detected BOTH by the expression. The first and easy one is a tag called “simple field” and looks like this (for a merge field named SomeField):

<w:fldSimple w:instr=" MERGEFIELD SomeField \* MERGEFORMAT ">
<w:r>
	<w:rPr>
		<w:noProof/>
	</w:rPr>
	<w:t>«SomeField»</w:t>
</w:r>
</w:fldSimple>

The second one, called “complex field” is just that: complex! But in the end, the huge tag cloud contains somewhere inside a little section like the following, which contains exactly the information we need:

<w:instrText xml:space="preserve"> MERGEFIELD  SomeField  \* MERGEFORMAT </w:instrText>

Afterwards we just add the detected placeholders and the associated values to a map and call the MailMerger class. This will create a new document, containing the values instead of the placeholders. Finally, we save this report into an OutputStream.

Some important information about docx4j version and the dependencies

You shouldn’t use the latest “release” version of docx4j for this solution, because it still has some bugs (especially regarding merge fields with spaces in their titles). Instead, you should use the latest nightly build, which can be found here. Also make sure, that you are not using the reference JAXB implementation from SUN / Oracle: every time we tried to apply our XPATH expression, it crashed with an exception. So, use MOXy instead, which is included into Eclipse Link. To use it, you also need to add docx4j-MOXy-JAXBContext-3.0.0.jar to your classpath, which you can also find within the docx4j download area.

Summary

Creating MS Word reports with java / groovy libraries is still not that easy. The classic MS Word support of Apache POI is just to weak to be used in production systems and even the docx API appears to be not that stable. But docx4j is in a pretty usable state, as long as you are willing to use a “nightly build” and to dig a bit within the XML the docx file is made of. What is your oppinion about the MS Word support of POI and what do you think about docx4j in general and the solutions we’ve presented here? We would be glad to hear from you!

Advertisements

7 thoughts on “Creating MS Word reports with Java

  1. Be aware, that the XML which describes the merge field may or may not contain a specific mergeformat, so “\* MERGEFORMAT”) is optional. I’ve adapted the regexp to reflect this.

    • Thank you very much for the mention of Templater. I tried it out and it was exactly what I needed. Thanks!

  2. Pingback: Reporting Tips - Conclusions and Recommendations

    • Hello,

      be aware that this is Groovy-Code (not Java). The “it” parameter within the Closure passed to findAll is the implicit parameter (which stands for the “current” element while iterating over the collection). You don’t need to declare it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s