Overview of the problem
Well, it seems whenever a lot of publishing is going on, python eventually SEGVs somewhere in garbage collection and/or heap allocation. Obviously something in the heap is corrupted. As of 13/12/05, I have tried:
1. Running with valgrind. No errors detected apart from a bunch of Addr4 errors in PyObject_Free() and PyObject_Realloc(), these are apparently normal. See http://pxr.openlook.org/pxr/source/Misc/README.valgrind for an explanation.
2. Building an Aug 30 2005 version of Timba. Things seemed stable then, but no longer.
3. Added tests of conversion on the kernel side, but nothing fails (apart from the regular valgrind errors mentioned in 1.)
4. Went through the entire OCTOPython code and cleaned up and commented all reference handling. Found a potential leak and a potential under-ref problem with None, but that didn't help. Commit of 13/12/05.
13/12/05: Rebuilding the toolchain
As of 13/12/05: I'm going to rebuild the entire toolchain from scratch, with gcc-3.4. See ./RebuildingPythonNotes for config/build details.
13/12/05: didn't help. Will now rebuild Python with debugging support, and try to valgrind the problem.
13/12/05: crashes in sip. Going back to Qt-3.3.2; sip-4.1.1 and pyqt-3.13.
14/12/05: with the above setup the browser lasts longer but still falls over eventually. Will try to create a full-debug build of the toolchain (Qt-3.3.5, sip-4.3.2, pyqt-3.15.1):
- python configured with --with-pydebug --without-pymalloc --prefix=/usr/local/python-2.3.5-debug. Now, make sure 'python' is aliased to this interpreter, and LD_LIBRARY_PATH contains /usr/local/python-2.3.5-debug/lib.
Modified OCTOPython and PyApps/src/Makefile.am (int ~/alt/LOFAR) to INCLUDE the right Python.h (doing all this in the alt/LOFAR dir). Now, the browser can be run with debugging using
$ export LD_LIBRARY_PATH=/usr/local/python-2.3.5-debug/lib $ export PYTHONPATH=~/alt/LOFAR/installed/symlinked/libexec/python $ /usr/local/python-2.3.5-debug/bin/python \ ~/alt/LOFAR/installed/symlinked/bin/meqbrowser.py
This aborts when a kernel connects, see ./AbortLog. QCustomEvent() is up in the stack somewhere, will look into it later.
Now, let's try a debug build with Qt-3.3.2, sip-4.1.1 and PyQt-3.13.
- no more crashes on startup
valgrind detects some errors in scintilla when loading document, and when finished compiling TDL script. see ./ScintillaError, Error 1 and 2. Mostly mismatched new/new [] and delete/delete [].
Further along, I finally see a SEGV in sip/PyQt: ./PyQtErrors. Unfortunately, being an older version of sip, this is not a bug I can report...
Next step: trying the same with QScintilla-1.65 built from source:
Run /usr/local/qt-x.y.z/bin/qmake qscintilla.pro to configure, then make && make install.
Reconfigured PyQt without the "-n" and "-o" options, it found the new qscintilla itself.
Valgrind no longer reports any errors listed in ScintillaError, so that's good. Still waiting for that SEGV, gave up under valgrind. Will try a stress test instead.
OK, just completed a whole phase solution with a LOT of publishing and the browser survived. Will now install QScintilla-1.65 globally, reinstall PyQt, and see if it works with the current Debian-stable qt.
- No, there's still a SEGV (did a second run contiguously) but this time somewhere deep in std::~hashtable called from DMI::~Record. Appears to be a different bug entirely (NB: see below!). Next step:
Further steps: three things to explore.
1: Try a global install of the new qscintilla:
### build qscintilla # export QTDIR=/usr/share/qt3 # cd /usr/local/src/qscintilla-1.65/qt # qmake qscintilla.pro # make clean && make -j8 && make install ### build sip # apt-get remove sip4 python2.3-sip4-{dev,qt3} python-sip4-dev # cd /usr/local/src/sip-4.3.2 # python configure.py # make clean && make -j8 && make install ### build PyQt # cd /usr/local/src/PyQt-3.15.1 # python configure.py # make clean && make -j8 && make install ### build PyQwt # cd /usr/local/src/PyQwt-4.2 # python configure.py # make clean && make -j8 && make install ### add library dir to /etc/ld.so.conf # echo /usr/share/qt3/lib >> /etc/ld.so.conf # ldconfigStress test result: falls over quickly (after one or two minutes running a phase solution). Thing to do now:2.1. Run browser with stock newer sip/qt against valgrind addrcheck with python-related suppressions enabled. Result: found error in python from within OCTOPython, see ./CreatePyObject! Damn, this appears to look the same as the original crash, a corrupt python heap. So much for that, giving up on the newer Qts and sips for now.
2.2. Reinstall older sip/pyqt (Qt-3.3.2, sip-4.1.1, PyQt-3.13, follow same procedure as above) and stress-test. Result: SEGV in hashtable again! See stack here: ./HashTableSegv. May be a gcc-lib bug, but I think it's in CountedRef -- fixed now. Stress-test again, result: no, crashes in the same place again. On further consideration, there may be more subtle mt problems in ref -- see http://lofar9.astron.nl/bugzilla/show_bug.cgi?id=300 -- but I just don't see these scenarios arising here. Next step: rebuild with the standard allocator and run with valgrind. Result: no errors, stress test successful. May indicate a bug in the g++ mt_alloc implementation. Will switch to standard allocator and try again, perhaps switch everything back to standard allocator full-time?
2: Run browser build against python-2.3.5-debug and older sip/pyqt with valgrind (addrcheck tool) and look where the heap is corrupted. Result: no crash yet but it runs too slow, if the bug in 2.2 is the cause, then valgrind may upset the thread timing so much it will never hit the bug...
3. Run browser build against python-2.3.5-debug and newer sip/pyqt with valgrind, and try to figure out what the pyqt problem is, because then we can report it at least. Check if maybe it is only present with Debian's qt-3.3.4 and not qt-3.3.5??
