Improving performance of Jenkins builds

For building our software we use the continuous integration build server solution Jenkins CI. In the last week we got problems with the build duration. After analyzing the build progress we found some hints for improving the build performance.

Bottlenecks during compiling

The first hint was the duration time of compiling the source code. Before we added a new HDD, build time over all was 11 minutes. After them build time increased to 15 minutes! Another build job which only compiled an artifact without executing any tests needed 3 minutes more – in comparison with the older configuration.
So we analyzed the complete build progress step by step and found a bottleneck. Our software using grails during build progress. For compiling the source code into java bytecode, jenkins writes into .grails in the home directory. We moved the complete home onto a SSD and got back the lost 3 minutes.

Bottlenecks during testing

Another bottleneck is the working directory in Jenkins called “workspace”. Each job gets its own workspace. During generating the artifacts the complete source code repository will checked out into this workspace. Any test result, compiled class and the artifact itself will be buffered in the workspace directory. After build was successful the artifacts and the test results will be copied to jobs directory. If it is possible to mount the workspace of a build job on a SSD you will get any better performance.

The myth of SSD vs. tmpfs

Maybe in your opinion, it could be much better to mount the workspace and .grails directory and any other highly used directories on a tmpfs. We tried this option and the result is: Yes it will decrease the build time. But not as much as you think. Mounting the workspace and the .grails directory on a tmpfs decreased the build time at maximum 10 seconds in comparison to a SSD. The improvement to mount these directories on a tmpfs would be in reducing the write cycles on the SSD which increases the lifetime of a SSD.

Resume

It is possible to impove the duration time of a Jenkins build. The first approach to improve the duration time of a build is to get out in which directories will jenkins write/delete/copy files. In our example we improved the build time by 33% moving the workspace and all other directories on a SSD.
Another possibility to increase the build time could be a faster CPU. This option would need a build server (maybe a jenkins slave) with a desktop CPU. Because the Intel Xeons are much slower than the most high end desktop CPUs. We didn’t tested this, but one of our developer workstations with an Intel i7 CPU only needs 80% of the time for building the same artifact.

Advertisements

Lampengeist

Our research and development department uses a little tool called “Lampengeist” for indicating if a build on the CI was successful or not. All we needed to implement this feature is the Jenkins plugin Hudson Post build task and a small shell script – and of course a programmable power connector. We choose the energenie EG-PMS-LAN by gembird because it is manageable via network.

Copy script

The script is very simple. It only needs the tools curl and logger. Attention: This script uses advanced scripting functions of the Bash shell. Maybe it will not work with other Shells – e.g. dash. Take the script below and copy it into directory /usr/local/bin/ on your build server.

#!/bin/bash

# This script toggles a socket on
# "energenie LAN programmable power connector"

typeset -a connectors=( pms.host.lan )
typeset -a passwords=( p4ssw0rd )
typeset toggle="" state_dir="/var/tmp" lock_file="/var/lock/toggle_lamp"
typeset -i i=0 socket=1

typeset -r connectors passwords state_dir

while [ -e ${lock_file} ]; do
    if [ $i -gt 10 ]; then
        logger -p user.error -t `basename $0` -s -- "Could not execute - lock file detected."
        echo "Please contact administrator if problem exists for longer time." >&2
        exit 3
    fi
    i=`expr $i + 1`
    sleep 2
done

touch $lock_file

################# FUNCTIONS ###################

usage() {
    cat << EOF
You called ${0} with unsupported option(s).
Usage: ${0} [1|2|3|4] <on|off>
Numbers 1 to 4 stands for the socket number. If no socket is given, it will
toggle socket 1 per default.
Please try again.
EOF
}

get_states() {
# get states of sockets
    if [ $# -ne 1 ]; then
        return 1
    else
        srv=$1
    fi
    states=( $(curl -f http://${srv}/status.html 2>/dev/null | sed -r "s/(.*)((ctl.*)([0|1]),([0|1]),([0|1]),([0|1]))(.*)/\4 \5 \6 \7/") )
}

toggle() {
    local server="" str_state=""
    local -i i=0 state sckt

    if [ $# -ne 3 ]; then
        return 1
    fi

    while [ $# -gt 0 ]; do
        case $1 in
            1|2|3|4)
                sckt=${1}
                shift
                ;;
            "on")
                str_state="${1}"
                state=1
                shift
                ;;
            "off")
                str_state="${1}"
                state=0
                shift
                ;;
            *)
                server="${1}"
                shift
                ;;
        esac
    done

    # poll status and toggle only if needed
    get_states ${server}
    if [ ${state} -ne ${states[$( expr ${sckt} - 1 )]} ]; then
        curl -f -d "ctl${sckt}=${state}" http://${server}/ &>/dev/null
        logger -p user.info -t `basename $0` -- "state of ${server} socket ${sckt} toggled ${str_state} by ${LOGNAME}"
    fi
}

persist() {
# for cron job use only
# saves state of sockets

    local state_file
    local -i i=0 j=0
    while [ ${i} -lt ${#connectors[*]} ]; do
        state_file=${state_dir}/${connectors[$i]}

        if (curl -f -d "pw=${passwords[$i]}" http://${connectors[$i]}/login.html 2>/dev/null | grep -q 'status.html'); then
            logger -p user.info -t `basename $0` -- "Save states of ${connectors[$i]} to file ${state_file}"
            # get states
            get_states ${connectors[$i]}
            echo "SavedStates=( ${states[*]} )" > ${state_file}

            j=0
            while [ $j -lt ${#states[*]} ]; do
                j=`expr ${j} + 1`
                toggle ${j} off ${connectors[$i]}
                sleep 1
            done

            curl -f http://${connectors[$i]}/login.html &>/dev/null
            logger -p user.info -t `basename $0` -- "States saved and all sockets switched off"
        else
            logger -p user.error -t `basename $0` -s -- "Login to ${connectors[$i]} failed."
        fi

        i=`expr ${i} + 1`
        typeset +r state_file
    done
}

recover() {
# recovers states from state file

    local state_file new_state
    local -a SavedStates
    local -i i=0 j=0

    i=0
    while [ ${i} -lt ${#connectors[*]} ]; do
        typeset -r state_file=${state_dir}/${connectors[$i]}

        if [ -r ${state_file} ]; then

            source ${state_file}
            if (curl -f -d "pw=${passwords[$i]}" http://${connectors[$i]}/login.html 2>/dev/null | grep -q 'status.html'); then

                logger -p user.info -t `basename $0` -- "Restore socket states from ${state_file} to ${connectors[$i]}"
                j=0
                while [ ${j} -lt ${#SavedStates[*]} ]; do
                    if [ ${SavedStates[$j]} -eq 0 ]; then
                        new_state="off"
                    else
                        new_state="on"
                    fi
                    j=`expr ${j} + 1`
                    toggle ${j} ${new_state} ${connectors[$i]}
                    sleep 1
                done

                curl -f http://${connectors[$i]}/login.html &>/dev/null
                logger -p user.info -t `basename $0` -- "Socket states restored and switched on if needed."
            else
                logger -p user.error -t `basename $0` -s -- "Login to ${connectors[$i]} failed."
            fi
            rm ${state_file}

        else
            logger -p user.error -t `basename $0` -s -- "Could not read file ${state_file}"
        fi

        i=`expr ${i} + 1`
    done
}

common() {
# common mode

local -i i=0

while  [ ${i} -lt ${#connectors[*]} ]; do
    state_file=${state_dir}/${connectors[$i]}
    if [ -e ${state_file} ]; then
        # state file exists -> do not toggle life, change in state file only
        if [ ${new_state} = "on" ]; then
            new_state=1
        elif [ ${new_state} = "off" ]; then
            new_state=0
        fi
        socket=`expr ${socket} - 1`

        source $state_file
        if [ ${SavedStates[${socket}]} -ne ${new_state} ]; then
            SavedStates[${socket}]=$new_state
            echo "SavedStates=( ${SavedStates[*]} )" > ${state_file}
            logger -p user.info -t `basename $0` -- "Toggled state of socket ${socket} to ${new_state} in state file by ${LOGNAME}"
        fi

    else
        if (curl -f -d "pw=${passwords[$i]}" http://${connectors[$i]}/login.html 2>/dev/null | grep -q 'status.html'); then
            toggle ${socket} ${new_state} ${connectors[$i]}
#            curl -f -d "ctl${socket}=${new_state}" http://${connectors[$i]}/ &>/dev/null
            curl -f http://${connectors[$i]}/login.html &>/dev/null
        else
            logger -p user.error -t `basename $0` -s -- "Login to ${connectors[$i]} failed."
        fi
    fi
    i=$( expr $i + 1 )
done

}
############# END FUNCTIONS ##################

typeset -r curl_bin="$(which curl | head -n 1)"

if [ -z "${curl_bin}" ]; then
    echo "Tool curl not found. Please install it."
    exit 1
fi

if [ $# -lt 1 ]; then
    echo "No action provided. What should I do?"
    exit 1
fi

while [ $# -ge 1 ]; do
    case ${1} in
        "on"|"off")
            new_state=${1}
            mode="common"
            ;;
        1|2|3|4)
            socket=${1}
            mode="common"
            ;;
        "-r"|"--recover")
            mode="recover"
            ;;
        "-s"|"--save")
            mode="save"
            ;;
        *)
            usage
            rm $lock_file && exit 2
            ;;
    esac

    shift
done

case ${mode} in
    "recover")
        recover
        ;;
    "save")
        persist
        ;;
    "common")
        common
        ;;
esac

rm $lock_file && exit 0 || exit 1

You can entitle the file as you want. But the file name of the script will be taken as “tag” in the logger utility. This means the file name will be posted at the 4th field in syslog file. The name of the user executing the script will posted into syslog, too. And finally the action – toggle socket X on|off – will posted to syslog file. Messages like these will posted to your system log file:

Mar 19 15:00:13 hostname toggle_lamp: state of pms.host.lan socket 1 toggled on by jenkins
Mar 19 15:19:08 hostname toggle_lamp: state of pms.host.lan socket 1 toggled off by jenkins

After copying the script onto the build server, you should make it executable (for jenkins). Go to /usr/local/bin and execute chmod 755 toggle_lamp. Edit file and change pms.host.lan and p4ssw0rd to the hostname or IP and password of your manageable power connector.

Configure post build task

Now you can configure your project for switching sockets on or off. Go to the Jenkins start page and choose the project/job which should toggle your lamp. Click the link configure and scroll to the end of configure page. In section Post-Build-Actions you will find a new option called Post build task. Activate this option and it will expand. The Hudson post build task plugin will scan the log file generated by Jenkins.

In our environment we are scanning the build log for ABORTED, BUILD SECCESSFUL and BUILD FAILED. If the build successfully finished we call /usr/local/bin/toggle_lamp 1 off to switch off the lamp. And we call /usr/local/bin/toggle_lamp on if build failed or was aborted. Scroll to the end of the page and click the button Save if you have defined your tasks.

Power-saving cron jobs

The script posted above has implemented an advanced feature. You can create a cron job which will scan the state of all sockets of your power management socket, persist the state in a file and then switch off all sockets. In the morning a second cron job will read the state file and restore the state of the sockets. The user calling the script via cron will need write permissions (create and delete files) on directory /var/tmp/.

Examples:

# save state of manageable power connector and switch off all sockets
30 20 * * 1-5	test -x /usr/local/bin/toggle_lamp && /usr/local/bin/toggle_lamp --save
# restore state of manageable power connector
30 06 * * 1-5	test -x /usr/local/bin/toggle_lamp && /usr/local/bin/toggle_lamp --recover

The cron jobs above will save the states to file at 20:30 each day from Monday until Friday. On Saturday and Sunday all sockets will be disabled. The saved states will be restored to power connector sockets at 06:30 in the morning.

Executing Grails tests in parallel

As our Grails-backend continues to grow our test count increases, too. With the build times on our Hudson continuous integration (CI) server reaching around 13 minutes we decided to spend some resources on making the builds faster. Some speedup was achieved by upgrading the hardware and optimizing specific integration test and production code that slowed them down. Another big chunk of build time was cut off by upgrading Grails from 1.3.2 to 1.3.5 which provided a speedup of approx. 25%. All these measures together brought the build time down to around 9 minutes again. The speedup was nice but the feedback cycle still felt to long for us. The integration tests are the biggest factor still with around 6 minutes.

So we had a look into how to run the integration tests in parallel instead of leaving some cores of the build server idle. Ted Naleid provided a nice starting point for our efforts. But since our build is based on Apache Ant and we have a big runtime discrepancy between our unit and integration tests his approach was not enough for us.

We first translated his solution to our build.xml and then separated the integration tests in two groups with similar runtime. We start the unit tests, integration test group 1 and integration test group 2 in parallel. This currently means three compiles to different directories and parallel test execution. On our quad core server the build time is significantly improved despite the extra work. The integration tests are heavily CPU bound in our case. We moved some static code analysis and test coverage reports to a nightly build to improve the feedback cycle even further. The end result is a build time around 6-7 minutes which may not be lightning fast but still ways better than the 13 minutes before our efforts. See the relevant ant targets below.

<target name="-call-grails-with-testdir">
  <chmod file="${grails}" perm="u+x"/>
  <property name="target.dir" value="target/test-${testdir}" />
  <exec dir="${basedir}" executable="${grails}" failonerror="true">
    <arg value="-Dgrails.work.dir=${target.dir}"/>
    <arg value="-Dgrails.project.class.dir=${target.dir}/classes"/>
    <arg value="-Dgrails.project.test.class.dir=${target.dir}/test-classes"/>
    <arg value="-Dgrails.project.test.reports.dir=target/test-reports"/>
    <arg value="${grails.task}"/>
    <arg value="${task.param}"/>
    <env key="GRAILS_HOME" value="${grails.home}"/>
  </exec>
</target>

<target name="parallel_test" description="--> Run tests in parallel">
  <path id="slowtests.path">
    <fileset dir="${basedir}/test/integration">
      <include name="SlowTests.groovy" />
      <include name="AnotherSlowTest.groovy" />
      <include name="MoreSlowTests.groovy" />
   </fileset>
  </path>
  <path id="normal_tests.path">
    <fileset dir="${basedir}/test/integration">
      <include name="**/*Tests.groovy" />
      <include name="**/*Test.groovy" />
      <exclude name="SlowTests.groovy" />
      <exclude name="AnotherSlowTest.groovy" />
      <exclude name="MoreSlowTests.groovy" />
    </fileset>
  </path>
  <pathconvert property="slowtests" pathsep=" ">
    <path refid="slowtests.path" />
    <chainedmapper>
      <flattenmapper />
      <compositemapper>
        <globmapper from="*Test.groovy" to="*" />
        <globmapper from="*Tests.groovy" to="*" />
      </compositemapper>
    </chainedmapper>
  </pathconvert>
  <pathconvert property="normaltests" pathsep=" ">
    <path refid="normal_tests.path" />
    <chainedmapper>
      <flattenmapper />
      <compositemapper>
        <globmapper from="*Tests.groovy" to="*" />
        <globmapper from="*Test.groovy" to="*" />
      </compositemapper>
    </chainedmapper>
  </pathconvert>
  <parallel>
    <antcall target="-call-grails-with-testdir">
      <param name="grails.task" value="test-app"/>
      <param name="task.param" value="-unit -xml" />
      <param name="testdir" value="unit" />
    </antcall>
    <antcall target="-call-grails-with-testdir">
      <param name="grails.task" value="test-app"/>
      <param name="task.param" value="-integration -xml ${slowtests}" />
      <param name="testdir" value="integration" />
    </antcall>
    <antcall target="-call-grails-with-testdir">
      <param name="grails.task" value="test-app"/>
      <param name="task.param" value="-integration -xml ${normaltests}" />
      <param name="testdir" value="integration2" />
    </antcall>
  </parallel>
</target>

Conclusion
Although Grails does not provide native support for parallel execution of tests (especially integration tests) it can pay off to do it manually using your build system. In our case it chopped away a good part of our build time and brought down test execution duration from around 6 to 4 minutes in spite of extra compilation runs. We may refine our approach further but hope that grails will ultimately support parallel test execution natively sometime soon.