v1.0, Sep 2013

Abstract

In January of 2013, it was reported in Launchpad that "staff client patron search results continuously eats memory".
[https://bugs.launchpad.net/evergreen/+bug/1110817]
The report was triaged and given a High importance, but was not further acted upon. To observe the problem, you simply execute a patron search and then let the staff client idle. Memory usage will slowly increase. But so far we’ve found that the memory consumption may not be unbounded. We have observed a cyclical pattern where memory usage increases and decreases over a period of roughly 4 hours. It is not yet clear if any memory is actually being leaked during these cycles.

Plan

Given the unpredictable nature of troubleshooting and debugging software, and memory leaks in particular, this high-level plan of attack was developed to cover the process for this project:

  1. use AutoHotKey
    [http://www.autohotkey.com/]
    or similar tool to construct a test sequence of mouse and keyboard clicks for a reproducible scenario

  2. include use of ProcessExplorer
    [http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx]
    and Irfanview
    [http://www.irfanview.com/]
    in this scenario for snapshotting memory usage

  3. use this sequence for an initial set of measurements prior to any changes to the staff client

  4. start a process of elimination by removing wholesale whole features/widgets/behavior from the UI but in a way that will still allow the test sequence to run (for example, start with the actual list widget itself)

    1. determine whether the code removed affects the amount of memory leaked

    2. if not, restore the code and try a different area. if so, restore the code and then remove a smaller more targeted part of it.

    3. Repeat step 4 as needed until a small enough section of code is identified that can be changed. If such code cannot be found, we’ll need to revise the test sequence and test the parts of the UI that we were previously relying on (the input fields and the search button). We may need to add new code to supply the input data if we remove the input widgets.

  5. similar to step 4, but instead of eliminating code, we’re tweaking code with alternative implementations

    1. determine whether the code changed affects the amount of memory leaked

    2. repeat as necessary. If necessary, back-track and re-implement larger sections of code. The worst case scenario would be essentially re-implementing the entire interface, which should only be necessary if the underlying problem has multiple variables that are interrelated (like the multiple inputs on helicopter controls, versus the relative simplicity of fixed-wing aircraft)

Execution

Prerequisites

We installed the following applications on a virtual image of Windows XP, service pack 3:

For the staff client, we registered it against a server running a matching version of master with stock test data, and logged in to set the default (sticky) window sizes to something less than fullscreen for better screenshots. With Irfanview, we made sure the defaults for screen capture included Control+F11 as the hot key.

AutoHotKey

The following script was produced for AutoHotKey. It makes certain assumptions that will need to be adjusted if moved to a different environment. In particular, the window title for the staff client main window will include the workstation name, and the window title for Process Explorer will include the machine name and the Windows user. The script also references full paths for the location of the programs it invokes, and it references a particular location for Irfanview to use when saving screenshots. Login credentials for the Evergreen server are also encoded into the script. The script also assumes that Wordpad is running with an untitled/unsaved document, serving as a textual preface for a given series of screenshots.

RecordMemory(title)
{
    WinActivate %title%
    WinWaitActive %title%
    Sleep 5000
    Send ^{F11}
    Sleep 5000
    WinActivate Process Explorer - Sysinternals: www.sysinternals.com [WINXPVMFX\phasefx]
    WinWaitActive Process Explorer - Sysinternals: www.sysinternals.com [WINXPVMFX\phasefx]
    Sleep 5000
    Send ^{F11}
    Sleep 5000
    WinActivate %title%
    WinWaitActive %title%
    Sleep 5000
}

#space::
Run "C:\Program Files\IrfanView\i_view32.exe"
WinWaitActive IrfanView
Send c!3!s{Tab}{Tab}{Tab}E:\gpls\{Tab}{Tab}bppppp{Enter}
Run "C:\Documents and Settings\phasefx\My Documents\Downloads\ProcessExplorer\procexp.exe"
WinWaitActive Process Explorer - Sysinternals: www.sysinternals.com [WINXPVMFX\phasefx]
Send {Down}{Left}
RecordMemory("Document - WordPad")
Run "C:\Program Files\Evergreen Staff Client Master\evergreen.exe"
WinWaitActive Evergreen Staff Client - 0_new331
RecordMemory("Evergreen Staff Client - 0_new331")
Send !uadmin{Enter}demo123{Enter}
WinWait 1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com
RecordMemory("1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com")
Send {F4}
RecordMemory("1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com")
Send Smith{Enter}
RecordMemory("1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com")
Loop 60 {
    Sleep 60000
    RecordMemory("1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com")
}
MsgBox Finished
return

The above snippet was added to the default script for AutoHotKey, and is invoked by pressing the Windows Key and the spacebar together. A longer running version of this script adapted for a different environment using the same prerequisites produced 1210 screenshots, found here:

The Data

For any given process, Process Explorer reports on two memory values that we are interested in: Working Set and Virtual Size. Virtual Size is the size of the virtual address space the process has allocated for use. This memory may be found in RAM or on disk, and the process may not actually be using all of it. Working Set refers to a subset of a process' virtual address space that is physically resident in RAM. Some of the working set memory may actually be shared between processes, for example, if it’s holding shared libraries (DLL’s). Our AutoHotKey script periodically takes snapshots of Process Explorer, and it’s the evergreen.exe process being listed and its columns for Working Set and Virtual Size that we are looking at.

Here’s a compressed timeline of events:

Time Working Set Virtual Size Comments

5:26pm

77,780 K

167,104 K

Patron Search interface loaded but no search entered.

5:26pm

81,976 K

167,788 K

Search for "Smith" executed.

5:27pm

82,136 K

166,764 K

Idle.

5:29pm

82,504 K

166,764 K

Idle.

5:30pm

82,976 K

165,740 K

Idle.

5:45pm

87,652 K

162,668 K

Idle.

6:00pm

92,232 K

162,668 K

Idle.

6:15pm

96,256 K

162,668 K

Idle.

6:30pm

101,968 K

165,740 K

Idle.

6:45pm

105,856 K

169,836 K

Idle.

7:00pm

110,900 K

174,956 K

Idle.

7:13pm

115,372 K

180,076 K

Recorded peak of Working Set. We’re snapshotting roughly every minute, so it may have went above this before dropping.

7:15pm

82,156 K

169,836 K

Drop of over 30,000 K; but don’t know if it was all at once or a steady drop within the minute between recordings.

7:16pm

80,960 K

164,716 K

Recorded bottom of Working Set for this dip.

7:18pm

81,436 K

164,716 K

Idle.

7:29pm

85,244 K

164,716 K

Idle.

7:45pm

90,480 K

164,716 K

Idle.

8:01pm

96,312 K

164,716 K

Idle.

8:15pm

100,052 K

164,716 K

Idle.

8:31pm

105,032 K

169,836 K

Idle.

8:45pm

109,700 K

173,932 K

Idle.

9:04pm

116,740 K

181,100 K

Peak.

9:05pm

82,328 K

165,740 K

Bottom.

10:52pm

115,576 K

180,076 K

Peak.

10:54pm

82,672 K

167,788 K

Bottom.

12:41am

115,500 K

180,076 K

Peak.

12:44am

80,972 K

165,740 K

Bottom.

2:30am

115,496 K

180,076 K

Peak.

2:33am

82,792 K

166,764 K

Bottom.

4:18am

115,600 K

180,076 K

Peak.

4:21am

81,080 K

165,740 K

Bottom.

6:07am

115,464 K

180,076 K

Peak.

6:09am

81,768 K

166,764 K

Bottom.

7:44am

111,472 K

175,980 K

End of experiment.

Preliminary Analysis

The peaks and valleys for this memory usage pattern do not appear to be trending upwards. I suspect it’s coming from something deeply internal to xulrunner dealing with garbage collection. Because of this and the likelihood that the test scenario as constructed does not happen often in real life, we don’t recommend that we put effort into disrupting or mitigating this pattern.

It may be productive, however, to use this method to look for memory leaks under different test scenarios. We have the beginnings of a UI regression and performance testing suite here, and could, for example, create tools to compare the UI screenshots between runs. These screenshots would as a matter of course change as Evergreen and Windows evolves, and the tests will need manual intervention to set a new baseline for expected/correct behavior whenever this occurs, but between these disruptions image diffing could help us find discrepencies in UI behavior and performance. Finding a command line tool for capturing process information to use instead of Process Explorer could allow for memory usage comparisons between runs.