Patron Search Memory Fix Project ================================ Jason Etheridge Equinox Software Inc. v1.0, Sep 2013 Abstract -------- In January of 2013, it was reported in Launchpad that "staff client patron search results continuously eats memory". footnote:[https://bugs.launchpad.net/evergreen/+bug/1110817] The report was triaged and given a High importance, but was not further acted upon. To observe the problem, you simply execute a patron search and then let the staff client idle. Memory usage will slowly increase. But so far we've found that the memory consumption may not be _unbounded_. We have observed a cyclical pattern where memory usage increases and decreases over a period of roughly 4 hours. It is not yet clear if any memory is actually being _leaked_ during these cycles. Plan ---- Given the unpredictable nature of troubleshooting and debugging software, and memory leaks in particular, this high-level plan of attack was developed to cover the process for this project: 1. use AutoHotKey footnote:[http://www.autohotkey.com/] or similar tool to construct a test sequence of mouse and keyboard clicks for a reproducible scenario 2. include use of ProcessExplorer footnote:[http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx] and Irfanview footnote:[http://www.irfanview.com/] in this scenario for snapshotting memory usage 3. use this sequence for an initial set of measurements prior to any changes to the staff client 4. start a process of elimination by removing wholesale whole features/widgets/behavior from the UI but in a way that will still allow the test sequence to run (for example, start with the actual list widget itself) a. determine whether the code removed affects the amount of memory leaked b. if not, restore the code and try a different area. if so, restore the code and then remove a smaller more targeted part of it. c. Repeat step 4 as needed until a small enough section of code is identified that can be changed. If such code cannot be found, we’ll need to revise the test sequence and test the parts of the UI that we were previously relying on (the input fields and the search button). We may need to add new code to supply the input data if we remove the input widgets. 5. similar to step 4, but instead of eliminating code, we’re tweaking code with alternative implementations a. determine whether the code changed affects the amount of memory leaked b. repeat as necessary. If necessary, back-track and re-implement larger sections of code. The worst case scenario would be essentially re-implementing the entire interface, which should only be necessary if the underlying problem has multiple variables that are interrelated (like the multiple inputs on helicopter controls, versus the relative simplicity of fixed-wing aircraft) Execution --------- Prerequisites ~~~~~~~~~~~~~ We installed the following applications on a virtual image of Windows XP, service pack 3: * AutoHotKey - version 1.1.13.00 * Process Explorer - version 15.40 * Irfanview - version 4.36 * Evergreen Staff Client - built from master with tip 64a503f12cd1094636818d4dd5725d73905517bd footnote:[http://git.evergreen-ils.org/?p=Evergreen.git;a=commit;h=64a503f12cd1094636818d4dd5725d73905517bd] and running with xulrunner version 14.0.1.4577 For the staff client, we registered it against a server running a matching version of master with stock test data, and logged in to set the default (sticky) window sizes to something less than fullscreen for better screenshots. With Irfanview, we made sure the defaults for screen capture included Control+F11 as the hot key. AutoHotKey ~~~~~~~~~~ The following script was produced for AutoHotKey. It makes certain assumptions that will need to be adjusted if moved to a different environment. In particular, the window title for the staff client main window will include the workstation name, and the window title for Process Explorer will include the machine name and the Windows user. The script also references full paths for the location of the programs it invokes, and it references a particular location for Irfanview to use when saving screenshots. Login credentials for the Evergreen server are also encoded into the script. The script also assumes that Wordpad is running with an untitled/unsaved document, serving as a textual preface for a given series of screenshots. [source,Javascript] --------------------------------------------- RecordMemory(title) { WinActivate %title% WinWaitActive %title% Sleep 5000 Send ^{F11} Sleep 5000 WinActivate Process Explorer - Sysinternals: www.sysinternals.com [WINXPVMFX\phasefx] WinWaitActive Process Explorer - Sysinternals: www.sysinternals.com [WINXPVMFX\phasefx] Sleep 5000 Send ^{F11} Sleep 5000 WinActivate %title% WinWaitActive %title% Sleep 5000 } #space:: Run "C:\Program Files\IrfanView\i_view32.exe" WinWaitActive IrfanView Send c!3!s{Tab}{Tab}{Tab}E:\gpls\{Tab}{Tab}bppppp{Enter} Run "C:\Documents and Settings\phasefx\My Documents\Downloads\ProcessExplorer\procexp.exe" WinWaitActive Process Explorer - Sysinternals: www.sysinternals.com [WINXPVMFX\phasefx] Send {Down}{Left} RecordMemory("Document - WordPad") Run "C:\Program Files\Evergreen Staff Client Master\evergreen.exe" WinWaitActive Evergreen Staff Client - 0_new331 RecordMemory("Evergreen Staff Client - 0_new331") Send !uadmin{Enter}demo123{Enter} WinWait 1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com RecordMemory("1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com") Send {F4} RecordMemory("1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com") Send Smith{Enter} RecordMemory("1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com") Loop 60 { Sleep 60000 RecordMemory("1: admin@BR1-winxpvmfx.phasefx.evergreencatalog.com") } MsgBox Finished return --------------------------------------------- The above snippet was added to the default script for AutoHotKey, and is invoked by pressing the Windows Key and the spacebar together. A longer running version of this script adapted for a different environment using the same prerequisites produced 1210 screenshots, found here: http://nox.esilibrary.com/~jason/patron_search/screenshots/ The Data ~~~~~~~~ For any given process, Process Explorer reports on two memory values that we are interested in: _Working Set_ and _Virtual Size_. Virtual Size is the size of the virtual address space the process has allocated for use. This memory may be found in RAM or on disk, and the process may not actually be using all of it. Working Set refers to a subset of a process' virtual address space that is physically resident in RAM. Some of the working set memory may actually be shared between processes, for example, if it's holding shared libraries (DLL's). Our AutoHotKey script periodically takes snapshots of Process Explorer, and it's the evergreen.exe process being listed and its columns for Working Set and Virtual Size that we are looking at. Here's a compressed timeline of events: [width="50%",options="header"] |================================ |Time | Working Set | Virtual Size | Comments |5:26pm | 77,780 K | 167,104 K | Patron Search interface loaded but no search entered. |5:26pm | 81,976 K | 167,788 K | Search for "Smith" executed. |5:27pm | 82,136 K | 166,764 K | Idle. |5:29pm | 82,504 K | 166,764 K | Idle. |5:30pm | 82,976 K | 165,740 K | Idle. |5:45pm | 87,652 K | 162,668 K | Idle. |6:00pm | 92,232 K | 162,668 K | Idle. |6:15pm | 96,256 K | 162,668 K | Idle. |6:30pm | 101,968 K | 165,740 K | Idle. |6:45pm | 105,856 K | 169,836 K | Idle. |7:00pm | 110,900 K | 174,956 K | Idle. |7:13pm | 115,372 K | 180,076 K | Recorded _peak_ of Working Set. We're snapshotting roughly every minute, so it may have went above this before dropping. |7:15pm | 82,156 K | 169,836 K | Drop of over 30,000 K; but don't know if it was all at once or a steady drop within the minute between recordings. |7:16pm | 80,960 K | 164,716 K | Recorded _bottom_ of Working Set for this dip. |7:18pm | 81,436 K | 164,716 K | Idle. |7:29pm | 85,244 K | 164,716 K | Idle. |7:45pm | 90,480 K | 164,716 K | Idle. |8:01pm | 96,312 K | 164,716 K | Idle. |8:15pm | 100,052 K | 164,716 K | Idle. |8:31pm | 105,032 K | 169,836 K | Idle. |8:45pm | 109,700 K | 173,932 K | Idle. |9:04pm | 116,740 K | 181,100 K | _Peak_. |9:05pm | 82,328 K | 165,740 K | _Bottom_. |10:52pm | 115,576 K | 180,076 K | _Peak_. |10:54pm | 82,672 K | 167,788 K | _Bottom_. |12:41am | 115,500 K | 180,076 K | _Peak_. |12:44am | 80,972 K | 165,740 K | _Bottom_. |2:30am | 115,496 K | 180,076 K | _Peak_. |2:33am | 82,792 K | 166,764 K | _Bottom_. |4:18am | 115,600 K | 180,076 K | _Peak_. |4:21am | 81,080 K | 165,740 K | _Bottom_. |6:07am | 115,464 K | 180,076 K | _Peak_. |6:09am | 81,768 K | 166,764 K | _Bottom_. |7:44am | 111,472 K | 175,980 K | End of experiment. |================================ Preliminary Analysis ~~~~~~~~~~~~~~~~~~~~ The peaks and valleys for this memory usage pattern do not appear to be trending upwards. I suspect it's coming from something deeply internal to xulrunner dealing with garbage collection. Because of this and the likelihood that the test scenario as constructed does not happen often in real life, we don't recommend that we put effort into disrupting or mitigating this pattern. It may be productive, however, to use this method to look for memory leaks under different test scenarios. We have the beginnings of a UI regression and performance testing suite here, and could, for example, create tools to compare the UI screenshots between runs. These screenshots would as a matter of course change as Evergreen and Windows evolves, and the tests will need manual intervention to set a new baseline for expected/correct behavior whenever this occurs, but between these disruptions image diffing could help us find discrepencies in UI behavior and performance. Finding a command line tool for capturing process information to use instead of Process Explorer could allow for memory usage comparisons between runs.