My recent task was optimizing a parallel event driven system, called ContextLattice. I’ve done some decent job to make it processes events faster Specifically, the runtime performance went from ~10,000 events/s to ~200,000 events per second*, giving 20x improvement. That’s incredible looking back, so I’d like to share the experience around.
*: data obtained from a 8-core, Intel Xeon 2.0 Ghz machine.
So what did I do? General rule of thumb:
- Use gprof. Early optimization is the root of all evils. Never attempt to optimize something until you run it. gprof makes it clear what functions or data structures take up most of the execution time. There are other performance analyzing tools, but gprof is quite well known and I did not use others.
- Replace global locks with finer-grained ones. In a multi-thread program, a global lock is typically what constrains a system’s performance, and unfortunately, I had several global locks in the original system. If possible, replace a global lock by several non-global locks. Of course, that depends on the program’s semantics.
- Change data structures. Data structures impact performance a lot. Sometimes it’s hard to tell which data structure is the best (what’s the best container for std::queue? std::deque or std::list?**). Just experiment with all candidate data structures. I use C++’s standard library and boost library data structures for most part, which enable easy substitution.
- Use memory pool. If there are a lot of dynamic memory allocation/deallocation, you can reduce those allocation by recycle unused memory blocks. Boost has a memory pool implementation, but I ended up created one myself, because I’ve had no experience with the Boost’s memory pool. What I did was a simple FIFO queue of pointers, and that proves to be useful. I’d like to hear if any one has experience with Boost memory pool.
- Reduce the work in the non-parallelizable code. Ignore those functions that run in parallel with others, as they don’t impact performance a lot. Focus on those code protected by the global lock. Relocate the code to parallelizable functions if possible.
- Reduce unnecessary memory copy. ContextLattice came from Mace. I used Mace APIs for some implementation, but it turned out some of them did not fit very well and I had to make some special object constructors for special purposes to reduce memory copy.
**: 1. std::queue uses std::deque as the internal container by default. I found using std::list as the container is faster, as least in the situation I encountered.
2. Also, std::map is faster than std::tr1::unordered_map(which is a hash map), in the situation I encountered ( A map has only a few items, but there are many of these maps created) Similarily std::set is faster than std::tr1::unordered_set (which is a hash set).
3. std::map has a feature that the iterator iterates the structure in sorted order (for example, if the index type is integer, the iterator iterates from the item with the smallest index value), so you can actually use it as a priority queue. But std::priority_queue is still faster if you only want to access the front of the structure.
There are other optimization experiences specific to ContextLattice. They are also interesting but I can’t explain them without introducing the internals of the system. So I’ll probably find some time to talk about it in details.