Ceph Async Messenger

This post discusses the high-level architecture of Ceph network layer focusing on Async Messenger code flow. Ceph currently has three messenger implementations – Simple, Async, xio. Async Messenger by far is the most efficient messenger over the others. It can handle different transport types like posix, rdma, dpdk. It uses a limited thread pool for connections (based on number of replicas or EC chunks) and polling system to achieve high-concurrency. It supports epoll, kqueue and select based on system’s environment and availability.


Async Messenger comprises mainly of three components:

  1. Network Stack – initializes stack based on Posix, RDMA, DPDK options
  2. Processor – handles socket layer communication
  3. AsyncConnection – maintains connection state machine

To understand the code flow, let’s look at the client-server model i.e. from the perspective of how a connection is established between Rados client and OSD. Starting with the OSD, three types of messengers are created – public, cluster and heartbeat:

int main() {
 Messenger *ms_public = Messenger::create(g_ceph_context, public_msg_type,..)
 Messenger *ms_cluster = Messenger::create(g_ceph_context, cluster_msg_type,..)
 Messenger *ms_hb_back_client = Messenger::create(g_ceph_context, cluster_msg_type,..)

Network Stack

On creation of an AsyncMessenger instance, it initiates the network stack. Based on the messenger type parameter, Posix (/rdma/dpdk) transport type stack is created. Heap memory is allocated here for Processors (more on this below) based on number of workers.

NetworkStack::NetworkStack(CephContext *c, const string &t)
: type(t), started(false), cct(c) {
  for (unsigned i = 0; i center.process_events(EventMaxWaitUs, &dur);

Number of Worker threads created is dependent on ms_async_op_threads config value. EventCenter class which is an abstraction to handle various event drivers such as epoll, kqueue, select is initiated for all the workers. Each worker thread then creates an epoll (/kqueue/select) instance (line#7) and starts waiting in order to process epoll events inside each thread (line#20).

Once this is all set up, OSD needs to bind the port and, listen in on the socket for incoming connections.

int main(){
  if (ms_public->bindv(public_addrs) bindv(cluster_addrs) start();
  // start osd
  err = osd->init();

Processor & AsyncConnection

Socket processing operations such as bind, listen, accept are handled by Processor class . This class is initiated in AsyncMessenger’s constructor. line#4 and line#7 call into Processor’s bind method. After the address and port are bound, it starts ‘listening’ on the socket and waits for incoming connections.

int Processor::bind(const entity_addrvec_t &bind_addrs,...){
 if (listen_addr.get_port()) {
                           [this, k, &listen_addr, &opts, &r]() {
        r = worker->listen(listen_addr, opts, &listen_sockets[k]);
      }, false);
int PosixWorker::listen(entity_addr_t &sa, const SocketOptions &opt,
                        ServerSocket *sock){
  r = ::bind(listen_sd, sa.get_sockaddr(), sa.get_sockaddr_len());
  r = ::listen(listen_sd, cct->conf->ms_tcp_listen_backlog);

After bind and listen, OSD is started in osd->init()  call (in ceph_osd.cc). This will call  into the Processor’s start method which adds the listen file descriptor to epoll with a callback to Processor::accept. Now every time there’s  a new connection, ‘EventEpoll’ is ready to process.

Processor::Processor(AsyncMessenger *r, Worker *w, CephContext *c)
   :  msgr(r), net(c), worker(w),
     listen_handler(new C_processor_accept(this)) {}

void Processor::start()
  // start thread
  worker->center.submit_to(worker->center.get_id(), [this]() {
       for (auto& l : listen_sockets) {
         if (l) {
          worker->center.create_file_event(l.fd(), EVENT_READABLE,
                                           listen_handler); }
     }, false);

Network connections between client and servers are maintained via AsyncConnection state machine.  On connection acceptance in Processor, it calls into AsyncConnection::accept which assigns an initial state of START_ACCEPTING. Further,  AsyncConnection read_handler which is dispatched as an external event, begins processing the connection:

void AsyncConnection::accept(ConnectedSocket socket, entity_addr_t &addr){

class C_handle_read : public EventCallback {
  AsyncConnectionRef conn;

  explicit C_handle_read(AsyncConnectionRef c): conn(c) {}
  void do_request(uint64_t fd_or_id) override {

void AsyncConnection::process() {

On the client side, RadosClient only initiates a messenger. An actual connection with an OSD is only established when an operation is submitted to OSD in Objecter module.

int librados::RadosClient::connect(){

void Objecter::_op_submit(Op *op, shunique_lock& sul, ceph_tid_t *ptid){
  // Try to get a session, including a retry if we need to take write lock
  int r = _get_session(op->target.osd, &s, sul);

int Objecter::_get_session(...){
  s->con = messenger->connect_to_osd(osdmap->get_addrs(osd));

This completes the connection establishment between Rados client and an OSD server. We did not delve into the finer details like policy parameters, heartbeat clusters, EventCenter code and more – it’s worth looking into those and piecing all this information together for broader understanding. Perhaps in another post!

OpenStack Tokyo Summit

OpenStack summit is a biannual conference for developers, users and admins of OpenStack cloud software. This being my second conference, I was better equipped with the knowledge of what to expect and how to better plan out my schedule around sessions. It is quite easy to get overwhelmed with the scale of the event, number of sessions/speakers and may feel like information overload if not planned well.

There are two main tracks of conference: 1) Main Conference 2) Design Summit. Main conference is for general attendees that includes keynotes, breakout tracks, hands-on labs and some sponsored sessions. Design summit is for developers and operators who contribute code and feedback for the next cycle. It does not reflect the classic tracks with speakers and sessions, and are not recorded at the moment.

My day 1 schedule included keynotes by Jonathan Bryce who was later joined by Egle Sigler, Lachlan Evenson and Takuya Ito who talked about OpenStack Yahoo! Japan usecase. My other sessions in the schedule are largely related to Swift – OpenStack Object Storage. Swift makes an ideal storage solution for web applications that need to store large volumes of data. Building web-applications using OpenStack Swift talk covers overview of swift with focus on it’s features and how they are useful for web application developers. It includes good code examples based off popular web frameworks AngularJS and Django. Shingled Magnetic Recording (SMR) Drives and Swift Object Storage was another great session on how SMR drives have the potential to reduce storage costs with data from public cloud access patterns observed at SoftLayer. This talk discusses SMR drives background, how to improve Swift performance with SMR, and potential swift changes that would enable better usage of SMR drives.

Swift associates user defined metadata attributes and values with containers and objects. Metadata search is gaining significant interest in the community and is currently not available in Swift. This talk on Boosting the Power of Swift Using Metadata Search describes components of design and implementation of a metadata search capability integrated with swift. Discusses enabling some key applications that were previously not possible. Another talk of value was on Erasure Coding in Swift. EC is moved out of beta and now supported in Swift. Swift Erasure Code Performance vs Replication: Analysis and Recommendations session discusses on how and where EC would be useful basing off performance study carried out on multiple clusters with various CPU, network, memory configurations. This helps in analyzing how different parameters affect performance per policy in Swift. At this summit, I had the opportunity to conduct a workshop on Git and Gerrit with Amy Marrich and Tamara Johnston. Git and Gerrit are some of the essential tools in getting started with OpenStack software development. This workshop walks-through the process of patch submission in OpenStack using a test repository.

For rest of the week my schedule was Design Summit. This part of the summit mainly focuses on software development discussions for the Mitaka cycle, existing barriers, new proposals and also feedback from operators. Generally the schedules for each day of the design summit are laid out in an etherpad prior to summit. Focus for swift community was on exciting new features like data-at-rest encryption, container sync, ring placement, symlinks, container sharding, fast-post and more. My current code contributions to Swift are on Encryption feature. We were able to make significant progress on pending design decisions, discuss on blockades and come up with measurable goals for the M cycle. It turned out to be highly productive and the discussions gave good insights into other upcoming features in swift.

Going to the summit, to be able to meet people in person, interact and discuss on-going work was highly valuable to me. You’d finally know persons behind IRC nicks and it is a sure way to enhance further online communications on IRC, email etc.

Forgot your vm password?

I run a couple of virtual machines on my work laptop and a couple more on my personal laptop. I didn’t open one of them for quite sometime and naturally forgot the password (reminder to keep easy/handy ones!).

I use Ubuntu vms created using virtual box. Here is what worked for me to reset vm password:

  •  Right after you boot your vm, hold on the Shift key, let it load the grub menu
  •  Select Ubuntu Recover mode option
  •  Scroll down and click ‘drop to root shell prompt’

At this point you could change your password by doing:

passwd {username}

If it threw the below error:

passwd: Authentication token manipulation error
passwd: password unchanged

It means, the filesytem is mounted in ready-only mode and hence is preventing from changing password. Make it read-write by running this:

mount -rw -o remount /

Resetting password with this should now work: passwd {username}

Mocking, Python

I have used Python mock and patch to do some testing for the OPTIONS call and also the server type check in Swift. unittest.mock is a library for testing in Python. It can be used to replace parts of the code, return data and then make assertions about them.

class unittest.mock.Mock(side_effect=None, return_value=DEFAULT, **kwargs)

side_effect: A function to be called whenever the Mock is called. It is useful to raise exceptions.
return_value: The value returned when the mock is called.

Mock provides a patching module that could be used to patch a specific function, class, class level attributes in the scope of test. For instance,

def test_foo(self, mock_urlopen):
def getheader(name):
d = {‘Server’: ‘server-type’}
return d.get(name)
mock_urlopen.return_value.info.return_value.getheader = getheader

In the above function that I used a mock patch to mock urlopen and it’s return value. To test the OPTIONS call in swift-recon, this simulation will faciliate me by returning header value of “Server” as “server-type”. Further, I could assert the return of the actual value returned by the scout_server_type method in recon (which calls OPTIONS), like this:

self.assertEqual(content, ‘server-type’)

Content here is the return value of OPTIONS call in recon.py.

To test the validate servers method (server_type_check) call, I could have created a fake class for scout_server_type or mock the method itself. I have done the latter:

def mock_scout_server_type(app, host):
url = ‘http://%s:%s/’ % (host[0], host[1])
response = responses[host[1]]
status = 200
return url, response, status

patches = [
mock.patch(‘sys.stdout’, new=stdout),

This mock, will allow me to replace the scout_server_type call and instead call “mock_scout_server_type” and returns those method’s values for the scope of the test.

with nested(*patches):

With the above, when server_type_check is called, the internal call to scout_server_type will be replaced by mock_scout_server_type.


In continuation with the last post – the recent development is I have build some tests to test the base class. Also added a Server header for OPTIONS. Server is a standard header to inform the client about the type of software used by origin server.[1]

The goal is to add a feature to swift-recon cli to validate servers in the ring. Swift-recon cli is a command line utility to obtain various metrics and telemetry from the servers. To validate servers, I need to get the Server type by calling OPTIONS. This support is added in swift-recon to call OPTIONS (only GET is supported until now). The new command to validate ring will be


The server_type_check method which I newly created will be called when the command is given. This fetches all the hosts and loops through them by calling scout_server_type (in Scout) which in turn calls OPTIONS on the host:

for url, response in self.pool.imap(recon.scout_server_type, hosts)

imap is a itertool [2] which creates iterator for looping. In the above line, “recon.scout_server_type” is called for each of the “hosts”.

The method responds to the arguments – object/container/account and verify if the servers in the ring indeed match to the argument given. For instance,

swift-recon object –validate-servers

For the above command, it will verify if all of the servers are of object type (by passing the host name to OPTIONS and getting the Server header). If it is not, it displays what they are – container/account.

This needs tests to verify. This post will be continued in the next by detailing the tests and/or any obstacles.

[1] http://tools.ietf.org/html/rfc7231#section-7.4.2
[2] https://docs.python.org/2/library/itertools.html

OPTIONS for Swift

Lately i have been working on implementing OPTIONS (1) verb for the storage nodes in Swift. The proxy server already implements this. The use of OPTIONS is to find out information on the server or other resources. In Swift’s case, OPTIONS is to return the publicly available HTTP methods of the server i.e. allowed methods, and also the server type. For this implementation, a base class was created (this was inherited by all the storage servers – Account, Container, Object) and all the public methods of the server were called by using:

inspect.getmembers(self, predicate=method)

For the returned methods from the server, the attributes ‘publicly_accessible’ is checked and added to a list and the sorted list is returned when OPTIONS is called. But these allowed methods change if the storage node cofig sets the server to be a replication server. To allow this check I extended the base class to include replication server check. The __call__(2) method was making this check earlier. I modified __call__ to access the methods from the base class.

I proceeded to add tests for it and everything went fine – almost. One of the existing Swift tests started failing. It was failing because, __call__ method is being tested by using a Magic Mock(3) method. One of the community members pointed the fact that magic mock method is callable but not a method. __call__ makes replication server check by calling the allowed methods from the base class. And the base class uses the predicate method. So it is modified to use predicate callable.

inspect.getmembers(self, predicate=callable)

That solved the issue for the test and I went on to add further tests to assert the OPTIONS functionality and also the replication server check in __call__ method.


(1)OPTIONS method requests information on communication options available for a specific resource, either at the origin or an intermediary. In Swift’s case, it is at origin which is the proxy server.

(2) __call__ If this method is implemented for a class, it makes a class instance callable. For example:

class foo:
def __init__(self, a, b)

def __call__(self, a, b)

x = foo(a, b)     -> calls __init__ method
x(a, b)               -> calls __call__ method

(3) Magic Mock: unittest.mock is Python’s test library. MagicMock is a subclass of Mock with default implementations of most of the magic methods. (This is a very interesting concept and it deserves an entire post about it along with examples.)

OpenStack Swift

From this post to the next few ones, I will write about my current contributions to OpenStack-Swift as part of the Outreach Program by Gnome foundation.

Swift is an object storage engine written in Python. I am working to make sure clusters are set up with the correct configuration. To do so OPTIONS * verb is used to get the information about servers which then the swift client can access and make sure the configuration is done right.

Before diving into the details of the implementation, this post will brief over swift architecture.

Swift object storage allows you to store and retrieve files. It is a distributed storage platform (API accessible) for static data. Data is stored in a structured three level – account, container, object.

Account server – Account storage contains metadata descriptive information about itself and the list of containers in the account.
Container server – Container storage area has metadata about itself (the container) and the list of objects in it.
Object server – Object storage location is where the data object and its metadata will be stored.

Swift Architecture

Storage node is a machine that is running swift services. A cluster is a collection of one or more nodes. Clusters can be distributed across different regions. Swift architecture stores by default three replicas (of partitions) for durability and resilience to failure.

The Auth system:

TempAuth – In this, the authentication part can be an external system or a subsystem with in Swift. The user passes an auth token to Swift and swift validates it with the auth system (external or within). If valid, the auth system passes back an expiration date, and Swift stores the expiration part in its cache.

Keystone Auth – Swift can authenticate against OpenStack Keystone system.

Extending Auth – This can be done by writing a new wsgi middleware just like Keystone project is implementing.

Proxy server – Proxy server lets you interact with the rest of the architecture. It takes incoming requests, looks up the location of account, container or object in the ring.

The ring – A ring defines mapping between entities stored on the disk and their physical location. There are different rings for accounts, containers and one object ring per storage policy (more on this below). To determine a location of any account, container or object, we need to interact with the ring. The ring is also responsible for determining which devices are used for handoff in failure scenarios.

Storage policies – Storage policies provide a way to give different feature levels and services in the way a object is stored. For instance, some of the containers might have default 3x replication, the new containers could be using 2x replication. Once a container is created with a storage policy, all the objects in it will also be created with the same policy.

Details of these will be explored as I move ahead with my work.

Sources: A lot of my understanding and this briefing comes by reading the docs and by making contributions to the code.