Uploaded my 1200th changeset. This was 9 months after the 1100th changeset. I was mostly focused on defining the research problem, so not much actual code writing. In the 100 changeset, there were 111 files modified and around 4000 lines of new code.
EventWave: Programming Model and Runtime Support for Tightly-Coupled Elastic Cloud Applications (PDF) (Slides)
Wei-Chiu Chuang, Bo Sang, Sunghwan Yoo, and Rui Gu (Department of Computer Science, Purdue University), Charles Killian (Purdue University and Google), and Milind Kulkarni (School of Electrical and Computer Engineering, Purdue University)
An attractive approach to leveraging the ability of cloud computing platforms to provide resources on demand is to build elastic applications, which can dynamically scale up or down based on resource requirements. To ease the development of elastic applications, it is useful for programmers to write applications with simple sequential semantics, without considering elasticity, and rely on runtime support to provide that elasticity. While this approach has been useful in restricted domains, such as MapReduce, existing programming models for general distributed applications do not expose enough information about their inherent organization of state and computation to provide such transparent elasticity.
We introduce EVENTWAVE, an event-driven programming model that allows developers to design elastic programs with inelastic semantics while naturally exposing isolated state and computation with programmatic parallelism. In addition, we describe the runtime mechanism which takes the exposed parallelism to provide elasticity. Finally, we evaluate our implementation through microbenchmarks and case studies to demonstrate that EVENTWAVE can provide efficient, scalable, transparent elasticity for applications run in the cloud.
We are preparing to release the code publicly.
My recent task was optimizing a parallel event driven system, called ContextLattice. I’ve done some decent job to make it processes events faster Specifically, the runtime performance went from ~10,000 events/s to ~200,000 events per second*, giving 20x improvement. That’s incredible looking back, so I’d like to share the experience around.
*: data obtained from a 8-core, Intel Xeon 2.0 Ghz machine.
So what did I do? General rule of thumb:
- Use gprof. Early optimization is the root of all evils. Never attempt to optimize something until you run it. gprof makes it clear what functions or data structures take up most of the execution time. There are other performance analyzing tools, but gprof is quite well known and I did not use others.
- Replace global locks with finer-grained ones. In a multi-thread program, a global lock is typically what constrains a system’s performance, and unfortunately, I had several global locks in the original system. If possible, replace a global lock by several non-global locks. Of course, that depends on the program’s semantics.
- Change data structures. Data structures impact performance a lot. Sometimes it’s hard to tell which data structure is the best (what’s the best container for std::queue? std::deque or std::list?**). Just experiment with all candidate data structures. I use C++’s standard library and boost library data structures for most part, which enable easy substitution.
- Use memory pool. If there are a lot of dynamic memory allocation/deallocation, you can reduce those allocation by recycle unused memory blocks. Boost has a memory pool implementation, but I ended up created one myself, because I’ve had no experience with the Boost’s memory pool. What I did was a simple FIFO queue of pointers, and that proves to be useful. I’d like to hear if any one has experience with Boost memory pool.
- Reduce the work in the non-parallelizable code. Ignore those functions that run in parallel with others, as they don’t impact performance a lot. Focus on those code protected by the global lock. Relocate the code to parallelizable functions if possible.
- Reduce unnecessary memory copy. ContextLattice came from Mace. I used Mace APIs for some implementation, but it turned out some of them did not fit very well and I had to make some special object constructors for special purposes to reduce memory copy.
**: 1. std::queue uses std::deque as the internal container by default. I found using std::list as the container is faster, as least in the situation I encountered.
2. Also, std::map is faster than std::tr1::unordered_map(which is a hash map), in the situation I encountered ( A map has only a few items, but there are many of these maps created) Similarily std::set is faster than std::tr1::unordered_set (which is a hash set).
3. std::map has a feature that the iterator iterates the structure in sorted order (for example, if the index type is integer, the iterator iterates from the item with the smallest index value), so you can actually use it as a priority queue. But std::priority_queue is still faster if you only want to access the front of the structure.
There are other optimization experiences specific to ContextLattice. They are also interesting but I can’t explain them without introducing the internals of the system. So I’ll probably find some time to talk about it in details.
I just pushed up the 900th ContextLattice code into repository!
This is another milestone, and 1000th changeset is just around the corner. Our system has been getting more feature-rich, robust and efficient everyday!
I have transitioned ContextLattice repository to BitBucket.
We used to host our project repository at our macesystems.org. But having used BitBucket for a while (Initially I was just using BB as a backup — Mercurial makes it easy to clone a repo to another places as backup), and it turns out to be quite reliable and high quality.
It features wiki, issue-tracking, code-review among other features. But it’s actually the unlimited academic free account that attracts me the most. There are several Mercurial repository services out there, apart from BitBucket, Google Code is an alternative I would possibly use, but Google Code can only host public code which is an alas to me.
As a side note, I recently registered a domain name ContextLattice.org, but I haven’t had chance to do some decent web design, so I’ll just leave it there. I expect to release our system & code when our paper is accepted at some major conferences.
We have just submitted our latest work to PLDI (Programming Language Design and Implementation), a top conference in the programming language area today!
Thanks for every one’s contribution, and it’s great to have six people working on the same paper (i.e. on average, every one just need to write less than two pages) We had some last-minute issue last night, but thankfully we were able to workaround and get the result in time.
Also, we have the 700th changeset committed by Rui Gu today just before the paper submission!
After 10 months of hardwork, the 600th changeset is committed! Also, since last December, we’ve contributed more than 30,000 lines of code. This is a really huge project!
Here’s an introduction to our HotDep’12 paper
While the word ‘cloud-computing’ has become a cliché, the real strength of cloud is not yet undermined.
Elasticity, the ability to scale up or scale down the application on demand, is not understood clearly to date. There are a few exception though, such as MapReduce or bag-of-tasks type of applications. However, These programming paradigms are simple, usually stateless, and therefore easy to perform dynamic provisioning and so on. For more general type of applications, there has been very little research on how to enable elasticity to them.
We propose a programming model which enables elasticity. Moreover, it makes the programmer easy to reason about the execution of the application in elastic environment. In the model, an application runs in a “logical node.” A logical node is composed of multiple “physical nodes.” Regardless of the number of physical nodes involves in a logical node, an event executed in a logical node is logically equivalent.
However, elasticity induces a new fault-tolerance issue. A naive elasticity scheme would increase failure rate by allowing a physical node failure to induce the entire logical node failure.
Interesting, we found the mechanisms of the programming model naturally provides a solution to the fault-tolerance problem. Specifically, our programming model can recover failures at per-physical-node basis, and thereby transparently masks the physical node failures. More details of the programming model and the failure tolerance mechanisms can be found in the paper.
Congratulations to all of the contributors: Bo, Sunghwan, Rui, Yu-Chen, Chip and Milind. I almost forgot to mention that our 500th commit of Mace-Fullcontext has been pushed up to the repository! I would like to thank all of you who have contributed to this project.
We officially committed the first changeset on last December 21, and it took less than 9 months to reach the 500th. Since that time, we have committed more than 22,000 lines of code into repository.
The mace fullcontext project is a follow-up project of Mace-incontext, a parallel execution event driven system, led by Sunghwan Yoo, and we started the discussion since about last October. So it’s been almost a year ago!