Category Archives: Software Development

Google Compute Engine免費試用

Google Compute Engine就是Google提供的公用雲系統,目前開放免費試用。申請網址在這裡:


Google Compute Engine的虛擬機是以分計費,對於還正在試用的使用者來說很方便,不像Amazon EC2是以小時計費,頻繁開開關關很燒錢。 不過Google Compute Engine起步比EC2晚,功能上還沒有EC2那麼完整,如果只是要玩玩是沒問題。

Job Hunting

Time to seek a real job.

I love the life at Purdue, but it’s time to explore the world and apply what I learned at Purdue for better use.

For any recruiters passing by, I plan to graduate in summer 2015, and I am looking for a position in software engineering in Systems, although I would accept any offers that pay better than the graduate stipend. :/ Here’s my resume.

I have a public repository at Bitbucket, so anyone interested in my work can take a look.

Optimization (My Experience)

My recent task was optimizing a parallel event driven system, called ContextLattice. I’ve done some decent job to make it processes events faster Specifically, the runtime performance went from ~10,000 events/s to ~200,000 events per second*, giving 20x improvement. That’s incredible looking back, so I’d like to share the experience around.

*: data obtained from a 8-core, Intel Xeon 2.0 Ghz machine.

So what did I do? General rule of thumb:

  1. Use gprof. Early optimization is the root of all evils. Never attempt to optimize something until you run it. gprof makes it clear what functions or data structures take up most of the execution time. There are other performance analyzing tools, but gprof is quite well known and I did not use others.
  2. Replace global locks with finer-grained ones. In a multi-thread program, a global lock is typically what constrains a system’s performance, and unfortunately, I had several global locks in the original system. If possible, replace a global lock by several non-global locks. Of course, that depends on the program’s semantics.
  3. Change data structures. Data structures impact performance a lot. Sometimes it’s hard to tell which data structure is the best (what’s the best container for std::queue? std::deque or std::list?**). Just experiment with all candidate data structures. I use C++’s standard library and boost library data structures for most part, which enable easy substitution.
  4. Use memory pool. If there are a lot of dynamic memory allocation/deallocation, you can reduce those allocation by recycle unused memory blocks. Boost has a memory pool implementation, but I ended up created one myself, because I’ve had no experience with the Boost’s memory pool. What I did was a simple FIFO queue of pointers, and that proves to be useful. I’d like to hear if any one has experience with Boost memory pool.
  5. Reduce the work in the non-parallelizable code. Ignore those functions that run in parallel with others, as they don’t impact performance a lot. Focus on those code protected by the global lock. Relocate the code to parallelizable functions if possible.
  6. Reduce unnecessary memory copy. ContextLattice came from Mace. I used Mace APIs for some implementation, but it turned out some of them did not fit very well and I had to make some special object constructors for special purposes to reduce memory copy.

**: 1. std::queue uses std::deque as the internal container by default. I found using std::list as the container is faster, as least in the situation I encountered.

2. Also, std::map is faster than std::tr1::unordered_map(which is a hash map), in the situation I encountered ( A map has only a few items, but there are many of these maps created) Similarily std::set is faster than std::tr1::unordered_set (which is a hash set).

3. std::map has a feature that the iterator iterates the structure in sorted order (for example, if the index type is integer, the iterator iterates from the item with the smallest index value), so you can actually use it as a priority queue. But std::priority_queue is still faster if you only want to access the front of the structure.

There are other optimization experiences specific to ContextLattice. They are also interesting but I can’t explain them without introducing the internals of the system. So I’ll probably find some time to talk about it in details.

GCC 4.8.1, C++11 and C++1y

The latest minor update of gcc, the gcc 4.8.1 was released a few days before. This release makes gcc the first compiler to implement all major features of C++11 specification.

See the C++11 support in gcc page for more information.

Incidentally, Clang/LLVM is also busy working towards a full implementation of C++11 in the next revision 3.3, which is due to release in this summer. See the C++11 support in Clang page.

It’s also worth mentioning the C++ standard committee is gearing towards the next C++ standard: the C++14 (or C++1y) and C++17. Both GCC and Clang/LLVM have limited experimental support of C++1y.

Concluding the Upper Half of 2013

Throughout the upper half of 2013, I was mainly working on two things: performance optimization and API.

Performance has been improved by nearly 20 fold since last September. We were able to diagnose the performance bugs in the original Mace code, and also found a better architecture of the fullcontext runtime. Maybe I would write a paper about the optimization some time.

I am recently working on a better fullcontext C++ API. The goal is, without impacting the performance, make the fullcontext API much easier so that writing a fullcontext application in C++ is straightforward and does not require much effort. In particular, previously most of the features were generated from the perl compiler, which translates the Mace syntax into corresponding (simple) C++ constructs. Because C++ is such a feature-rich language, it makes sense to implement most of features in C++ constructs. After several iteration of reimplementation, it is now possible to directly exploit C++ language features: overloading, overriding, inheritance, templates, template metaprogramming to provide an API straightforward to C++ programmers.

My next goal would be porting fullcontext API to Javascript, specifically, the V8 Javascript engine. It would be really awesome if V8 engine can run fullcontext applications!


Backward is a useful tool for C++ developers for diagnosing bugs.

In a nut shell, when a program terminates abnormally, Backward prints out the line in the source code that causes the trouble. It does this by catching signals and relies on either elfutils or GNU/binutils libraries to print out the debugging symbols in the executable.

To use it is as simple as 1, 2 3:

  1. include backward.hpp in the source file
  2. compile the source code with debugging symbols. i.e. compile with -g flag.
  3. link with elfutils or GNU/binutils library

GCC 4.8.0

GCC 4.8.0 is released recently and I’ve been testing it against Mace.

Interestingly starting from gcc 4.8.0, g++ uses a new error reporting format similar to LLVM/Clang which shows exactly where the error comes from (see: Expressive Diagnostics in Clang), unlike in the previous versions where the compiler only indicates which line it is. In the case where the error is inside a macro, it also expands the macro and shows where exactly the error resides. (nice feature)

However, gcc 4.8.0 added a new warning: -Werror=unused-local-typedefs for typedefs that are locally defined but not used (which is annoying). This new warning breaks several well known systems, in particular V8 engine and Boost library. Because it emits warning when compiling boost header files and mace regards warnings as errors, gcc 4.8.0 does not compile mace. A temporary workaround is to disable the warning (cmake -D CMAKE_CXX_FLAGS=-Wno-unused-local-typedefs)

The latest boost repo eliminated some of the unused local typedefs, but not all. I made a patch to eliminate the rest of the unused local typedefs and submitted to boost. Hope this patch will appear in the next boost revision!

Kernel Bug

Spent a whole night tracing a strange timer bug: Mace timer did not fire off at correct time on some machines. Specifically, on those machines, timer goes off one second earlier than requested. E.g., if you want it to fire off after 2 second, it actually fires off after 1 second. Strangely, it doesn’t happen to every machines.

Initially I thought it was a problem in Mace code, so I spent the night digging into the timer code. Finally, there’s one function to blame: pthread_cond_timedwait(). It does not behave correctly.

According this this StackOverflow post, this bug is triggered by a Linux kernel bug due to leap second. The problem will be gone after rebooting the machine, and indeed that problematic machine hasn’t been rebooted for more than a year. Well, this is so unexpected.