Uploaded my 1200th changeset. This was 9 months after the 1100th changeset. I was mostly focused on defining the research problem, so not much actual code writing. In the 100 changeset, there were 111 files modified and around 4000 lines of new code.
I checked in the 1100th changest in mace-fullcontext this week. Most of the recent changesets were focusing on runtime optimization. The runtime can process nearly 250,000 events per second! This is 10 times faster than 6 month ago. Also, it can process nearly 100,000 network message events per second, which is twice than a month ago. (network message events involves network sockets and serialization, so the throughput is lower than non-network-message events)
My recent task was optimizing a parallel event driven system, called ContextLattice. I’ve done some decent job to make it processes events faster Specifically, the runtime performance went from ~10,000 events/s to ~200,000 events per second*, giving 20x improvement. That’s incredible looking back, so I’d like to share the experience around.
*: data obtained from a 8-core, Intel Xeon 2.0 Ghz machine.
So what did I do? General rule of thumb:
- Use gprof. Early optimization is the root of all evils. Never attempt to optimize something until you run it. gprof makes it clear what functions or data structures take up most of the execution time. There are other performance analyzing tools, but gprof is quite well known and I did not use others.
- Replace global locks with finer-grained ones. In a multi-thread program, a global lock is typically what constrains a system’s performance, and unfortunately, I had several global locks in the original system. If possible, replace a global lock by several non-global locks. Of course, that depends on the program’s semantics.
- Change data structures. Data structures impact performance a lot. Sometimes it’s hard to tell which data structure is the best (what’s the best container for std::queue? std::deque or std::list?**). Just experiment with all candidate data structures. I use C++’s standard library and boost library data structures for most part, which enable easy substitution.
- Use memory pool. If there are a lot of dynamic memory allocation/deallocation, you can reduce those allocation by recycle unused memory blocks. Boost has a memory pool implementation, but I ended up created one myself, because I’ve had no experience with the Boost’s memory pool. What I did was a simple FIFO queue of pointers, and that proves to be useful. I’d like to hear if any one has experience with Boost memory pool.
- Reduce the work in the non-parallelizable code. Ignore those functions that run in parallel with others, as they don’t impact performance a lot. Focus on those code protected by the global lock. Relocate the code to parallelizable functions if possible.
- Reduce unnecessary memory copy. ContextLattice came from Mace. I used Mace APIs for some implementation, but it turned out some of them did not fit very well and I had to make some special object constructors for special purposes to reduce memory copy.
**: 1. std::queue uses std::deque as the internal container by default. I found using std::list as the container is faster, as least in the situation I encountered.
2. Also, std::map is faster than std::tr1::unordered_map(which is a hash map), in the situation I encountered ( A map has only a few items, but there are many of these maps created) Similarily std::set is faster than std::tr1::unordered_set (which is a hash set).
3. std::map has a feature that the iterator iterates the structure in sorted order (for example, if the index type is integer, the iterator iterates from the item with the smallest index value), so you can actually use it as a priority queue. But std::priority_queue is still faster if you only want to access the front of the structure.
There are other optimization experiences specific to ContextLattice. They are also interesting but I can’t explain them without introducing the internals of the system. So I’ll probably find some time to talk about it in details.
Throughout the upper half of 2013, I was mainly working on two things: performance optimization and API.
Performance has been improved by nearly 20 fold since last September. We were able to diagnose the performance bugs in the original Mace code, and also found a better architecture of the fullcontext runtime. Maybe I would write a paper about the optimization some time.
I am recently working on a better fullcontext C++ API. The goal is, without impacting the performance, make the fullcontext API much easier so that writing a fullcontext application in C++ is straightforward and does not require much effort. In particular, previously most of the features were generated from the perl compiler, which translates the Mace syntax into corresponding (simple) C++ constructs. Because C++ is such a feature-rich language, it makes sense to implement most of features in C++ constructs. After several iteration of reimplementation, it is now possible to directly exploit C++ language features: overloading, overriding, inheritance, templates, template metaprogramming to provide an API straightforward to C++ programmers.
GCC 4.8.0 is released recently and I’ve been testing it against Mace.
Interestingly starting from gcc 4.8.0, g++ uses a new error reporting format similar to LLVM/Clang which shows exactly where the error comes from (see: Expressive Diagnostics in Clang), unlike in the previous versions where the compiler only indicates which line it is. In the case where the error is inside a macro, it also expands the macro and shows where exactly the error resides. (nice feature)
However, gcc 4.8.0 added a new warning: -Werror=unused-local-typedefs for typedefs that are locally defined but not used (which is annoying). This new warning breaks several well known systems, in particular V8 engine and Boost library. Because it emits warning when compiling boost header files and mace regards warnings as errors, gcc 4.8.0 does not compile mace. A temporary workaround is to disable the warning (cmake -D CMAKE_CXX_FLAGS=-Wno-unused-local-typedefs)
The latest boost repo eliminated some of the unused local typedefs, but not all. I made a patch to eliminate the rest of the unused local typedefs and submitted to boost. Hope this patch will appear in the next boost revision! https://svn.boost.org/trac/boost/ticket/8546
I pushed up the 1000th changeset to mace-fullcontext yesterday!
This is a huge milestone to this project. Since December 2011 we started this project, it’s been 16 months already.
This is a huge huge project that relies on so many people’s contribution.
In these 1000 changesets, we made 40401 lines of code of change in 351 files.
That contains 5700 lines of perl, 15700 lines of C++, 17900 lines of Mace code and some shell scripts.
Contribution was made by 7 members in 16 months.
We submitted a paper yesterday that describes our system. It’s anonymized submission so I can’t talk about the details.
But let me talk about the engineering effort.
We’ve wrote 40 thousand lines of code in the past one and half years, and we’ve checked in almost 1000 changesets in this period. I am the main contributor and wrote about 60% of all the code. But this is a large project. We had six people working on this project, either contributing the code, testing the system or providing suggestions, so that’s equivalent to more than a hundred man-months!
All of the effort is simply for publishing one research paper. Phew! This is not easy!
I just pushed up the 900th ContextLattice code into repository!
This is another milestone, and 1000th changeset is just around the corner. Our system has been getting more feature-rich, robust and efficient everyday!
I have transitioned ContextLattice repository to BitBucket.
We used to host our project repository at our macesystems.org. But having used BitBucket for a while (Initially I was just using BB as a backup — Mercurial makes it easy to clone a repo to another places as backup), and it turns out to be quite reliable and high quality.
It features wiki, issue-tracking, code-review among other features. But it’s actually the unlimited academic free account that attracts me the most. There are several Mercurial repository services out there, apart from BitBucket, Google Code is an alternative I would possibly use, but Google Code can only host public code which is an alas to me.
As a side note, I recently registered a domain name ContextLattice.org, but I haven’t had chance to do some decent web design, so I’ll just leave it there. I expect to release our system & code when our paper is accepted at some major conferences.
We have just submitted our latest work to PLDI (Programming Language Design and Implementation), a top conference in the programming language area today!
Thanks for every one’s contribution, and it’s great to have six people working on the same paper (i.e. on average, every one just need to write less than two pages) We had some last-minute issue last night, but thankfully we were able to workaround and get the result in time.
Also, we have the 700th changeset committed by Rui Gu today just before the paper submission!