Paper Accepted!

Our paper “Programming Model and Runtime Support for Elastic, Tightly-Coupled Cloud Applications” is accepted at SOCC (ACM Symposium on Cloud Computing 2013)! I will be giving a talk as well as a poster in the venue.

The selection of the conference is quite competitive: out of 115 manuscripts, only 23 were accepted at 20% acceptance rate.

The conference will take place at Santa Clara University on October 1-3.

Here’s the abstract of the paper:

An attractive approach to leveraging the ability of cloud-computing platforms to provide resources on demand is to build elastic applications, which can scale up or down based on resource requirements dynamically. To ease the development of elastic applications, it is useful for programmers to write applications with simple, inelastic semantics and rely on runtime systems to provide transparent elasticity.

While this approach has been useful in restricted domains, such as MapReduce, existing programming models for general distributed applications do not expose enough information about their inherent organization of state and computation.
We introduce E VENT WAVE, an event-driven programming model that allows developers to design elastic programs with inelastic semantics while naturally exposing isolated state and computation with programmatic parallelism. In addi-
tion, we describe the runtime mechanism which takes the exposed parallelism to provide elasticity. Finally, we evaluate our implementation through microbenchmarks and case studies to demonstrate that E VENT WAVE can provide efficient, scalable, transparent elasticity for applications run in the cloud.

 

Amy

For some reason, my academic profile in Purdue’s system lists Amy Ingram as my advisor while she’s obviously not my advisor.

Amy was a secretary in the CS department until three years ago when she died of accident. She was such a kind lady and helped me out in all kinds of situations. I will never forget her. In memory of her, she will forever be my advisor in my academic profile, until I gradate. Rest in peace.

[美國]免跨國交易手續費的信用卡

執美國信用卡到其它國家消費是要另外付跨國交易手續費(foreign transaction fee)的,這應該是常識。這費用通常為0%~3%不等,包括Visa/Master/AmEx/Discover會先抽一部份的手續費,以Visa為例,會抽0.15%~1%不等的”International Service Assessment (ISA)”費用(資料來源),而發卡銀行可能會再另外抽額度不等的手續費。幸好,有些信用卡標榜免跨國交易手續費!

到處亂逛網路時發現已經有人整理好美國36張免跨國交易手續費的信用卡[英文網站]。這麼好康的東西當然就直接分享啦。資料是更新於2012年7月,或許有些已經不一樣了,請斟酌使用。

另外,Chase的網站上也有提供幾張免跨國交易手續費的信用卡

最後,Purdue的同學們,PEFCU的信用卡/Debit卡的跨國交易手續費是0.8%/1.0%─如果交易是美元計價就是0.8%,如果是非美元則是1.0%。資料來源

在美國吃素

分享一下如何在美國方便地吃到素食。

對於素食者來說,不論是要留學、或者在美國自助旅行都很實用的經驗! 首先,如果是在加州的話,基本上這篇文章可以直接忽略,因為在加洲各大城市華人都很多,很容易找到素食中國菜、甚至是素食台菜。

這篇文章是給住在中西部、方圓百里沒有間像樣的中國菜的朋友們的。

繼續讀下去 在美國吃素

1100th Changeset in Mace-fullcontext

I checked in the 1100th changest in mace-fullcontext this week. Most of the recent changesets were focusing on runtime optimization. The runtime can process nearly 250,000 events per second! This is 10 times faster than 6 month ago. Also, it can process nearly 100,000 network message events per second, which is twice than a month ago. (network message events involves network sockets and serialization, so the throughput is lower than non-network-message events)

Optimization (My Experience)

My recent task was optimizing a parallel event driven system, called ContextLattice. I’ve done some decent job to make it processes events faster Specifically, the runtime performance went from ~10,000 events/s to ~200,000 events per second*, giving 20x improvement. That’s incredible looking back, so I’d like to share the experience around.

*: data obtained from a 8-core, Intel Xeon 2.0 Ghz machine.

So what did I do? General rule of thumb:

  1. Use gprof. Early optimization is the root of all evils. Never attempt to optimize something until you run it. gprof makes it clear what functions or data structures take up most of the execution time. There are other performance analyzing tools, but gprof is quite well known and I did not use others.
  2. Replace global locks with finer-grained ones. In a multi-thread program, a global lock is typically what constrains a system’s performance, and unfortunately, I had several global locks in the original system. If possible, replace a global lock by several non-global locks. Of course, that depends on the program’s semantics.
  3. Change data structures. Data structures impact performance a lot. Sometimes it’s hard to tell which data structure is the best (what’s the best container for std::queue? std::deque or std::list?**). Just experiment with all candidate data structures. I use C++’s standard library and boost library data structures for most part, which enable easy substitution.
  4. Use memory pool. If there are a lot of dynamic memory allocation/deallocation, you can reduce those allocation by recycle unused memory blocks. Boost has a memory pool implementation, but I ended up created one myself, because I’ve had no experience with the Boost’s memory pool. What I did was a simple FIFO queue of pointers, and that proves to be useful. I’d like to hear if any one has experience with Boost memory pool.
  5. Reduce the work in the non-parallelizable code. Ignore those functions that run in parallel with others, as they don’t impact performance a lot. Focus on those code protected by the global lock. Relocate the code to parallelizable functions if possible.
  6. Reduce unnecessary memory copy. ContextLattice came from Mace. I used Mace APIs for some implementation, but it turned out some of them did not fit very well and I had to make some special object constructors for special purposes to reduce memory copy.

**: 1. std::queue uses std::deque as the internal container by default. I found using std::list as the container is faster, as least in the situation I encountered.

2. Also, std::map is faster than std::tr1::unordered_map(which is a hash map), in the situation I encountered ( A map has only a few items, but there are many of these maps created) Similarily std::set is faster than std::tr1::unordered_set (which is a hash set).

3. std::map has a feature that the iterator iterates the structure in sorted order (for example, if the index type is integer, the iterator iterates from the item with the smallest index value), so you can actually use it as a priority queue. But std::priority_queue is still faster if you only want to access the front of the structure.

There are other optimization experiences specific to ContextLattice. They are also interesting but I can’t explain them without introducing the internals of the system. So I’ll probably find some time to talk about it in details.