Ceph Async Messenger

This post discusses the high-level architecture of Ceph network layer focusing on Async Messenger code flow. Ceph currently has three messenger implementations – Simple, Async, xio. Async Messenger by far is the most efficient messenger over the others. It can handle different transport types like posix, rdma, dpdk. It uses a limited thread pool for connections (based on number of replicas or EC chunks) and polling system to achieve high-concurrency. It supports epoll, kqueue and select based on system’s environment and availability.


Async Messenger comprises mainly of three components:

  1. Network Stack – initializes stack based on Posix, RDMA, DPDK options
  2. Processor – handles socket layer communication
  3. AsyncConnection – maintains connection state machine

To understand the code flow, let’s look at the client-server model i.e. from the perspective of how a connection is established between Rados client and OSD. Starting with the OSD, three types of messengers are created – public, cluster and heartbeat:

int main() {
 Messenger *ms_public = Messenger::create(g_ceph_context, public_msg_type,..)
 Messenger *ms_cluster = Messenger::create(g_ceph_context, cluster_msg_type,..)
 Messenger *ms_hb_back_client = Messenger::create(g_ceph_context, cluster_msg_type,..)

Network Stack

On creation of an AsyncMessenger instance, it initiates the network stack. Based on the messenger type parameter, Posix (/rdma/dpdk) transport type stack is created. Heap memory is allocated here for Processors (more on this below) based on number of workers.

NetworkStack::NetworkStack(CephContext *c, const string &t)
: type(t), started(false), cct(c) {
  for (unsigned i = 0; i center.process_events(EventMaxWaitUs, &dur);

Number of Worker threads created is dependent on ms_async_op_threads config value. EventCenter class which is an abstraction to handle various event drivers such as epoll, kqueue, select is initiated for all the workers. Each worker thread then creates an epoll (/kqueue/select) instance (line#7) and starts waiting in order to process epoll events inside each thread (line#20).

Once this is all set up, OSD needs to bind the port and, listen in on the socket for incoming connections.

int main(){
  if (ms_public->bindv(public_addrs) bindv(cluster_addrs) start();
  // start osd
  err = osd->init();

Processor & AsyncConnection

Socket processing operations such as bind, listen, accept are handled by Processor class . This class is initiated in AsyncMessenger’s constructor. line#4 and line#7 call into Processor’s bind method. After the address and port are bound, it starts ‘listening’ on the socket and waits for incoming connections.

int Processor::bind(const entity_addrvec_t &bind_addrs,...){
 if (listen_addr.get_port()) {
                           [this, k, &listen_addr, &opts, &r]() {
        r = worker->listen(listen_addr, opts, &listen_sockets[k]);
      }, false);
int PosixWorker::listen(entity_addr_t &sa, const SocketOptions &opt,
                        ServerSocket *sock){
  r = ::bind(listen_sd, sa.get_sockaddr(), sa.get_sockaddr_len());
  r = ::listen(listen_sd, cct->conf->ms_tcp_listen_backlog);

After bind and listen, OSD is started in osd->init()  call (in ceph_osd.cc). This will call  into the Processor’s start method which adds the listen file descriptor to epoll with a callback to Processor::accept. Now every time there’s  a new connection, ‘EventEpoll’ is ready to process.

Processor::Processor(AsyncMessenger *r, Worker *w, CephContext *c)
   :  msgr(r), net(c), worker(w),
     listen_handler(new C_processor_accept(this)) {}

void Processor::start()
  // start thread
  worker->center.submit_to(worker->center.get_id(), [this]() {
       for (auto& l : listen_sockets) {
         if (l) {
          worker->center.create_file_event(l.fd(), EVENT_READABLE,
                                           listen_handler); }
     }, false);

Network connections between client and servers are maintained via AsyncConnection state machine.  On connection acceptance in Processor, it calls into AsyncConnection::accept which assigns an initial state of START_ACCEPTING. Further,  AsyncConnection read_handler which is dispatched as an external event, begins processing the connection:

void AsyncConnection::accept(ConnectedSocket socket, entity_addr_t &addr){

class C_handle_read : public EventCallback {
  AsyncConnectionRef conn;

  explicit C_handle_read(AsyncConnectionRef c): conn(c) {}
  void do_request(uint64_t fd_or_id) override {

void AsyncConnection::process() {

On the client side, RadosClient only initiates a messenger. An actual connection with an OSD is only established when an operation is submitted to OSD in Objecter module.

int librados::RadosClient::connect(){

void Objecter::_op_submit(Op *op, shunique_lock& sul, ceph_tid_t *ptid){
  // Try to get a session, including a retry if we need to take write lock
  int r = _get_session(op->target.osd, &s, sul);

int Objecter::_get_session(...){
  s->con = messenger->connect_to_osd(osdmap->get_addrs(osd));

This completes the connection establishment between Rados client and an OSD server. We did not delve into the finer details like policy parameters, heartbeat clusters, EventCenter code and more – it’s worth looking into those and piecing all this information together for broader understanding. Perhaps in another post!


OpenStack Tokyo Summit

OpenStack summit is a biannual conference for developers, users and admins of OpenStack cloud software. This being my second conference, I was better equipped with the knowledge of what to expect and how to better plan out my schedule around sessions. It is quite easy to get overwhelmed with the scale of the event, number of sessions/speakers and may feel like information overload if not planned well.

There are two main tracks of conference: 1) Main Conference 2) Design Summit. Main conference is for general attendees that includes keynotes, breakout tracks, hands-on labs and some sponsored sessions. Design summit is for developers and operators who contribute code and feedback for the next cycle. It does not reflect the classic tracks with speakers and sessions, and are not recorded at the moment.

My day 1 schedule included keynotes by Jonathan Bryce who was later joined by Egle Sigler, Lachlan Evenson and Takuya Ito who talked about OpenStack Yahoo! Japan usecase. My other sessions in the schedule are largely related to Swift – OpenStack Object Storage. Swift makes an ideal storage solution for web applications that need to store large volumes of data. Building web-applications using OpenStack Swift talk covers overview of swift with focus on it’s features and how they are useful for web application developers. It includes good code examples based off popular web frameworks AngularJS and Django. Shingled Magnetic Recording (SMR) Drives and Swift Object Storage was another great session on how SMR drives have the potential to reduce storage costs with data from public cloud access patterns observed at SoftLayer. This talk discusses SMR drives background, how to improve Swift performance with SMR, and potential swift changes that would enable better usage of SMR drives.

Swift associates user defined metadata attributes and values with containers and objects. Metadata search is gaining significant interest in the community and is currently not available in Swift. This talk on Boosting the Power of Swift Using Metadata Search describes components of design and implementation of a metadata search capability integrated with swift. Discusses enabling some key applications that were previously not possible. Another talk of value was on Erasure Coding in Swift. EC is moved out of beta and now supported in Swift. Swift Erasure Code Performance vs Replication: Analysis and Recommendations session discusses on how and where EC would be useful basing off performance study carried out on multiple clusters with various CPU, network, memory configurations. This helps in analyzing how different parameters affect performance per policy in Swift. At this summit, I had the opportunity to conduct a workshop on Git and Gerrit with Amy Marrich and Tamara Johnston. Git and Gerrit are some of the essential tools in getting started with OpenStack software development. This workshop walks-through the process of patch submission in OpenStack using a test repository.

For rest of the week my schedule was Design Summit. This part of the summit mainly focuses on software development discussions for the Mitaka cycle, existing barriers, new proposals and also feedback from operators. Generally the schedules for each day of the design summit are laid out in an etherpad prior to summit. Focus for swift community was on exciting new features like data-at-rest encryption, container sync, ring placement, symlinks, container sharding, fast-post and more. My current code contributions to Swift are on Encryption feature. We were able to make significant progress on pending design decisions, discuss on blockades and come up with measurable goals for the M cycle. It turned out to be highly productive and the discussions gave good insights into other upcoming features in swift.

Going to the summit, to be able to meet people in person, interact and discuss on-going work was highly valuable to me. You’d finally know persons behind IRC nicks and it is a sure way to enhance further online communications on IRC, email etc.

Forgot your vm password?

I run a couple of virtual machines on my work laptop and a couple more on my personal laptop. I didn’t open one of them for quite sometime and naturally forgot the password (reminder to keep easy/handy ones!).

I use Ubuntu vms created using virtual box. Here is what worked for me to reset vm password:

  •  Right after you boot your vm, hold on the Shift key, let it load the grub menu
  •  Select Ubuntu Recover mode option
  •  Scroll down and click ‘drop to root shell prompt’

At this point you could change your password by doing:

passwd {username}

If it threw the below error:

passwd: Authentication token manipulation error
passwd: password unchanged

It means, the filesytem is mounted in ready-only mode and hence is preventing from changing password. Make it read-write by running this:

mount -rw -o remount /

Resetting password with this should now work: passwd {username}

Mocking, Python

I have used Python mock and patch to do some testing for the OPTIONS call and also the server type check in Swift. unittest.mock is a library for testing in Python. It can be used to replace parts of the code, return data and then make assertions about them.

class unittest.mock.Mock(side_effect=None, return_value=DEFAULT, **kwargs)

side_effect: A function to be called whenever the Mock is called. It is useful to raise exceptions.
return_value: The value returned when the mock is called.

Mock provides a patching module that could be used to patch a specific function, class, class level attributes in the scope of test. For instance,

def test_foo(self, mock_urlopen):
def getheader(name):
d = {‘Server’: ‘server-type’}
return d.get(name)
mock_urlopen.return_value.info.return_value.getheader = getheader

In the above function that I used a mock patch to mock urlopen and it’s return value. To test the OPTIONS call in swift-recon, this simulation will faciliate me by returning header value of “Server” as “server-type”. Further, I could assert the return of the actual value returned by the scout_server_type method in recon (which calls OPTIONS), like this:

self.assertEqual(content, ‘server-type’)

Content here is the return value of OPTIONS call in recon.py.

To test the validate servers method (server_type_check) call, I could have created a fake class for scout_server_type or mock the method itself. I have done the latter:

def mock_scout_server_type(app, host):
url = ‘http://%s:%s/’ % (host[0], host[1])
response = responses[host[1]]
status = 200
return url, response, status

patches = [
mock.patch(‘sys.stdout’, new=stdout),

This mock, will allow me to replace the scout_server_type call and instead call “mock_scout_server_type” and returns those method’s values for the scope of the test.

with nested(*patches):

With the above, when server_type_check is called, the internal call to scout_server_type will be replaced by mock_scout_server_type.


In continuation with the last post – the recent development is I have build some tests to test the base class. Also added a Server header for OPTIONS. Server is a standard header to inform the client about the type of software used by origin server.[1]

The goal is to add a feature to swift-recon cli to validate servers in the ring. Swift-recon cli is a command line utility to obtain various metrics and telemetry from the servers. To validate servers, I need to get the Server type by calling OPTIONS. This support is added in swift-recon to call OPTIONS (only GET is supported until now). The new command to validate ring will be


The server_type_check method which I newly created will be called when the command is given. This fetches all the hosts and loops through them by calling scout_server_type (in Scout) which in turn calls OPTIONS on the host:

for url, response in self.pool.imap(recon.scout_server_type, hosts)

imap is a itertool [2] which creates iterator for looping. In the above line, “recon.scout_server_type” is called for each of the “hosts”.

The method responds to the arguments – object/container/account and verify if the servers in the ring indeed match to the argument given. For instance,

swift-recon object –validate-servers

For the above command, it will verify if all of the servers are of object type (by passing the host name to OPTIONS and getting the Server header). If it is not, it displays what they are – container/account.

This needs tests to verify. This post will be continued in the next by detailing the tests and/or any obstacles.

[1] http://tools.ietf.org/html/rfc7231#section-7.4.2
[2] https://docs.python.org/2/library/itertools.html

OPTIONS for Swift

Lately i have been working on implementing OPTIONS (1) verb for the storage nodes in Swift. The proxy server already implements this. The use of OPTIONS is to find out information on the server or other resources. In Swift’s case, OPTIONS is to return the publicly available HTTP methods of the server i.e. allowed methods, and also the server type. For this implementation, a base class was created (this was inherited by all the storage servers – Account, Container, Object) and all the public methods of the server were called by using:

inspect.getmembers(self, predicate=method)

For the returned methods from the server, the attributes ‘publicly_accessible’ is checked and added to a list and the sorted list is returned when OPTIONS is called. But these allowed methods change if the storage node cofig sets the server to be a replication server. To allow this check I extended the base class to include replication server check. The __call__(2) method was making this check earlier. I modified __call__ to access the methods from the base class.

I proceeded to add tests for it and everything went fine – almost. One of the existing Swift tests started failing. It was failing because, __call__ method is being tested by using a Magic Mock(3) method. One of the community members pointed the fact that magic mock method is callable but not a method. __call__ makes replication server check by calling the allowed methods from the base class. And the base class uses the predicate method. So it is modified to use predicate callable.

inspect.getmembers(self, predicate=callable)

That solved the issue for the test and I went on to add further tests to assert the OPTIONS functionality and also the replication server check in __call__ method.


(1)OPTIONS method requests information on communication options available for a specific resource, either at the origin or an intermediary. In Swift’s case, it is at origin which is the proxy server.

(2) __call__ If this method is implemented for a class, it makes a class instance callable. For example:

class foo:
def __init__(self, a, b)

def __call__(self, a, b)

x = foo(a, b)     -> calls __init__ method
x(a, b)               -> calls __call__ method

(3) Magic Mock: unittest.mock is Python’s test library. MagicMock is a subclass of Mock with default implementations of most of the magic methods. (This is a very interesting concept and it deserves an entire post about it along with examples.)

OpenStack Swift

From this post to the next few ones, I will write about my current contributions to OpenStack-Swift as part of the Outreach Program by Gnome foundation.

Swift is an object storage engine written in Python. I am working to make sure clusters are set up with the correct configuration. To do so OPTIONS * verb is used to get the information about servers which then the swift client can access and make sure the configuration is done right.

Before diving into the details of the implementation, this post will brief over swift architecture.

Swift object storage allows you to store and retrieve files. It is a distributed storage platform (API accessible) for static data. Data is stored in a structured three level – account, container, object.

Account server – Account storage contains metadata descriptive information about itself and the list of containers in the account.
Container server – Container storage area has metadata about itself (the container) and the list of objects in it.
Object server – Object storage location is where the data object and its metadata will be stored.

Swift Architecture

Storage node is a machine that is running swift services. A cluster is a collection of one or more nodes. Clusters can be distributed across different regions. Swift architecture stores by default three replicas (of partitions) for durability and resilience to failure.

The Auth system:

TempAuth – In this, the authentication part can be an external system or a subsystem with in Swift. The user passes an auth token to Swift and swift validates it with the auth system (external or within). If valid, the auth system passes back an expiration date, and Swift stores the expiration part in its cache.

Keystone Auth – Swift can authenticate against OpenStack Keystone system.

Extending Auth – This can be done by writing a new wsgi middleware just like Keystone project is implementing.

Proxy server – Proxy server lets you interact with the rest of the architecture. It takes incoming requests, looks up the location of account, container or object in the ring.

The ring – A ring defines mapping between entities stored on the disk and their physical location. There are different rings for accounts, containers and one object ring per storage policy (more on this below). To determine a location of any account, container or object, we need to interact with the ring. The ring is also responsible for determining which devices are used for handoff in failure scenarios.

Storage policies – Storage policies provide a way to give different feature levels and services in the way a object is stored. For instance, some of the containers might have default 3x replication, the new containers could be using 2x replication. Once a container is created with a storage policy, all the objects in it will also be created with the same policy.

Details of these will be explored as I move ahead with my work.

Sources: A lot of my understanding and this briefing comes by reading the docs and by making contributions to the code.

HTTP using cURL

The Wikipedia says, “HTTP is the foundation of data communication for the World Wide Web.”

And how does it do that? The internet communicates by exchanging data between web servers. HTTP (Hypertext Transfer protocol) standard facilitates this data exchange. HTTP uses the client-server model: An HTTP client opens a connection and sends a request message to an HTTP server; the server then returns a response message, usually containing the resource that was requested. After the response is delivered, the server closes the connection. HTTP uses verbs (or methods) to tell the server what to do with the data identified by the URL.

The most commonly used HTTP verbs are POST, GET, PUT, and DELETE. They correspond to create, read, update, and delete (or CRUD) operations, respectively. There are other verbs too HEAD, OPTIONS, TRACE, CONNECT – more information can be found at the official source IETF.

I’ll use cURL command line tool to demonstrate what these verbs do. cURL is used transfer data with URL syntax, between servers using the supported protocols (HTPP, FTP, LDAP etc). cURL comes installed mostly in all the operating systems. Or a simple yum install curl (or apt-get install curl) does it.

GET only retrieves the information (headers, message-body) and does not do any modifications. GET is the default method if no specific method is mentioned. For instance,

curl -v http://www.google.co.in

This takes the default method GET and returns the headers and the body. The option -v will let the request be more verbose. Given below is what you see as part of the information received for the command given above:

GET / HTTP/1.1
User-Agent: curl/7.32.0
Host: http://www.google.co.in
Accept: */*

HTTP/1.1 200 OK
Date: Sun, 07 Dec 2014 10:34:06 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html;

The first part represents the information in the request sent. GET is the method. Slash represents the URL given (we are directly requesting the host and not anything under it (google.co.in)). HTTP/1.1 is the version. The next three lines are the headers giving information about the request.

The next part is the response from the server. The format followed is Version, Response Code, Date and metainformation about the requested resource.This is followed by the document or what can also be called as message-body. The request returns a response code 200 (OK) when there is no error, and often returns 404 (NOT FOUND) or 400 (BAD REQUEST) in error cases.

HEAD returns the headers (metainformation) of the requested resource. It is identical to GET except it does not return a message-body in the response. GET and HEAD are hence considered Safe methods since they do not take any action other than retrieval.

PUT is mostly used for updating or modifying a resource with the request body containing the newly-updated representation of the original resource. It can also create a resource but is considered confusing (since the URI contains the value of a non-existent resource) and POST is recommended for creation (more on that below). On a successful update PUT request returns 200 response code (204 if no content in the body)

curl -v -X PUT /EmployeeInfo/12345/address -d “modified address”

The above contains the information on modified resource while -d is used to pass on the data. Option X is used to specify a method (other than default GET).
For obvious reasons of security you will not see this information being passed in URLs in user-end applications and systems. Instead this is wrapped in a form or some scripting is used. PUT is not a safe operation as it involves modification, but is Idempotent. In a sense, repeating the same operation by PUT results in an unmodified output. There can be scenarios (like incrementing a counter) where PUT is no more idempotent. For non-idempotent cases, POST is recommmended.

POST verb is often used to create resources. It requests the server to accept the entity enclosed as a subordinate of the original resource identified by the URL. The new resource is often assigned by the server and takes care of associating it to the parent (original resource).

curl -v -X POST /EmployeeInfo/

This indicates the creation of a resource under EmployeeInfo by the server. On successful creation, it returns HTTP status code 201. POST is neither safe nor idempotent. Two identical POST requests will result in creation of two resources with same information. Hence it is recommended for non-idempotent requests.

DELETE deletes the resource identified.

curl -v -X DELETE /EmployeeInfo/12345/

On successful deletion of the resource, a response 200 is returned, 202 if the request is accepted and yet to be enacted, 204 if it is enacted and no content in the response body. DELETE is idempotent since deleting the same resource will result in the same output (the entity is deleted. A point to be noted is 404 (NOT FOUND) is the response code once the resource is deleted and cannot be found anymore). In cases where DELETE decrements a counter it is no more idempotent.

OPTIONS represents a request of information about the communication options available with a resource or capabilities of a server. OPTIONS with an asterisk (*) is intended to apply to the server rather than a particular resource. Usually the response is 200 (OK) and an Allow header which specifies the HTTP methods that can be used.

curl -i -X OPTIONS http://example.com

An example response would be:

HTTP/1.1 200 OK

Detailed information about curl can be found at cURL
More information about HTTP/1.1 at RFC 2616