Google Summer of Code (GSOC) program 2018 has come to an end and we followed up with CNCF’s seven interns previously featured in this blog post to check in on how their summer project progressed.
Over the summer, Jiacheng Xu, a graduate student majoring in Computational Science and Engineering at École Polytechnique Fédérale de Lausanne(EPFL) in Switzerland worked with mentors Miek Gieben and Yong Tang on “Conditional Name Server Identifier for CoreDNS,” a project to identify the nodes without domain name collision in a distributed TendorFlow.
My Google Summer of Code project was to write a CoreDNS plugin named idetcd used for identifying nodes in a cluster without domain name collision. CoreDNS is a fast and flexible DNS server, and now it is a part of Kubernetes.
Motivation
In the distributed system, identifying nodes in the cluster is a big challenge since it’s quite often that nodes can get down or start/restart in the cluster which contains thousands of nodes and would be quite annoying if rebooting is needed after the membership in the cluster. For tackling this problem, usually requiring some complicated protocols or additional DevOps, what’s more, in most case it needs to customize the configuration for every different node which requires lots of work and also adds some risks to manage the system.
Distributed TensorFlow is also encountering the similar problems[1]. In the older version of distributed TensorFlow, adding a node to the cluster is not that easy. First of all, you need to bring the whole system down, and then customize a configuration for the new node, after that restart the whole system. It did require additional DevOps work and it’s also not “friendly” for some machine learning lovers to set up their own distributed TensorFlow clusters because, for people who are not familiar with distributed system, it would be great if she/he can add/delete a node from the system by one or two commands.
In practice, there are some approaches to solve this problem. For example, as mentioned before, building some protocols on the top of current Tensorflow codebase definitely can achieve this goal, but it probably is not a good way since may need to change the structure of Tensorflow, and bring some unnecessary complexities. And also, people who are not familiar with the infrastructure of TensorFlow still can not do anything if they still meet some problems when they deploy their own clusters. A more flexible way to do this is adding a separate module like DNS server, and nodes expose themselves through DNS.
CoreDNS has a plugin-based architecture and it is a really lightweight, flexible and extendable DNS server which can easily enable the customized plugin. For solving this issue, we can set up the CoreDNS plus customized plugin on every node in the TensorFlow cluster, and use the plugin to write/read DNS records in a distributed key-value store, like zookeeper and etcd. And this is what idetcd does.
How it works
The figure[2] above shows the scenario of how it works. The idea is quite simple: Set up CoreDNS server on every node, and node exposes itself by taking the free domain name.
In details, before the cluster is started, we set up CoreDNS server on every node in the cluster, for every node we just use the same configuration(See [example] below for details)which specifies the domain name pattern of nodes in this cluster, like worker{{.ID}}.tf.local., the maximum number of node allowed in this cluster, and etcd endpoints. Then we just start up all the nodes. That’s it!
Notice, at the starting time, all the nodes haven’t exposed themselves to other nodes. Then we just start CoreDNS server on every node, and nodes will try to find free slots in the etcd to expose. For example, the node may first try to take worker1.tf.local., and then it will try to figure out whether this domain name already exists in the etcd: if the answer is yes, then the node will try to increase the id to 2 and look into etcd again; otherwise, it will just take the name, and write it to the etcd. In this way, every node can dynamically find a domain name for itself without any collision. And also we don’t need to customize the configuration for every node; instead, we use the same configuration and let the nodes expose themselves!
Usage
Syntax
CoreDNS uses a configuration file called Corefile to specify the configuration, please go to CoreDNS Github repo for more details. Here is a snippet for idetcd syntax:
- endpoint ENDPOINT the etcd endpoints. Defaults to “http://localhost:2379“.
- limit LIMIT the maximum limit of the node number in the cluster, if some nodes is going to expose itself after the node number in the cluster hits this limit, it will fail.
- pattern PATTERN the domain name pattern that every node follows in the cluster. And here we use golang template for the pattern.
Example
In the following example, we are going to start up a cluster which contains 5 nodes, on every node we can get this project by:
Before you move to the next step, make sure that you’ve already set up a etcd instance, and don’t forget to write down the endpoints.
Then you need to add a Corefile which specifies the configuration of the CoreDNS server in the same directory of main.go, a simple Corefile example is as follows, please go to CoreDNS Github repo for more details.
And then you can generate binary file by:
Alternatively, if you have docker installed, you could also execute the following to build:
Then run it by:
After that, all nodes in the cluster are trying to find free slots in the etcd to expose themselves, once they succeed, you can get the domain name of every node on every node in the same cluster by:
Also ipv6 is supported:
Integration with AWS
Using CoreDNS with idetcd plugin to config the cluster is a one-time process which is different with the general config process. For example, if you want to set up a cluster which contains several instances on AWS, you can use the same configuration for every instance and let all the instances to expose themselves in the init process. This can be achieved by using cloud-init in user data. Here is a bash script example for AWS instances to execute at launch:
Experience with GSoC and CoreDNS
I really appreciate my mentor Yong Tang. Every time he answered my questions so patiently. Instead of telling me the correct answer directly, he always gave me some background about the problems and provided several approaches and then explained the pros and cons of every approach. So I really learned a lot from talking with Yong. He also told me some experience in his career development and gave me a lot of super useful advice about my further career. And he was really responsible and always replied to my email in an hour and provided me with a super solid technical support. I think he is the best mentor I’ve met before! And I don’t know how he can become better!
And also, CoreDNS is a really nice project, all developers there are so warm to me, and every time I posted issues on Github CoreDNS repo, they would give me the clear answer and technical help in a couple of hours.
Overall, I am really happy to be a part of GSoC this summer, and I enjoyed a lot and learned a lot during the project, and now I am an open source lover!