Introduction
At the beginning, we used to cluster our Spark applications using Apache Yarn as our main Resource Manager. At that time we considered an RM that was more like a “Spark extension,” which was basically used to optimize Spark processes and nothing more. Our usage of Yarn then never went beyond deploying those applications, such as monitoring them via web browser by typing something like http://yarn.url:4040/<spark_app>.
Problems in Yarn came with the need to deploy applications using different Spark versions, which were thus submitted to the cluster as Docker containers. This requirement with the relatively first deploy attempt raised a set of structural issues and unplanned limitations using Yarn. One example: deployed containers used a private ipv4 IP (eg: 10.0.0.20) in the slave, which of course was preventing us from connecting to the Spark-monitoring interface from outside.
After some brainstorming, we worked around this situation by setting up an IPv6 stack for each Yarn slave, thus assigning an IPv6 address to each container and finally reaching the exposed Spark web interface. It was surely working, but then we moved to Mesos.
Mesos
The day we migrated to Apache Mesos, we quickly got the feeling to approach a more generic-purpose RM, not only made for the Spark data processing, but also suitable for clustering other applications. This suddenly opened our minds in the direction to discover all the possibilities offered by a RM.
After a short while, we planned a pretty consistent infrastructure optimization, deciding which, among the current Data Engineering applications running on physical machines (non-Spark applications as well), were effectively able to be moved into Mesos.
We installed Marathon as framework on top of Mesos. From the migration to this new cluster, those applications would have achieved several features, such as instances scalability, auto-respawning in case of failure and so on. Last but not least, we could have also shut down some physical machines, saving a respectable amount of money per month. So, why not moving also other type of services, like Devops management web interfaces and more.
Everything exciting, but at that point then question was: how to reach those apps inside the cluster?
A dedicated DNS for the Mesos cluster
Mesos-DNS is a stateless service connected to the Mesos Master. It basically provides name resolution for the tasks running in the cluster, speaking the common DNS protocol. It works by continuously polling the Mesos Master, which is aware about the location and status of all the existent applications. Applications deployed via framework (Marathon) will be recorded with an URL composed as “task.framework.domain”
For instance, app01, in our setup, will be composed as “app01.marathon.plista”.
Mesos-DNS translates this URL into the slave IP where app01 is running and updates its internal record table (records generation):
Mesos-DNS hence, records which slave is currently hosting which applications and it directs clients requests accordingly.
In our implementation, Mesos-DNS is deployed two times on two different Mesos slaves as Docker containers, via Marathon, using a feature called “constraints” which avoids to deploy two times the same task on the same host. This of course with the intent to deploy “primary” and “secondary” DNS separately.
Example of “constraints” string to pin 2 Mesos-DNS instances on 2 specific and different slaves:
1234 ...“instances”: 2,“constraints”: [[“hostname”, “LIKE”, “mesos-ns[0,1].plista.com”], [“hostname”, “UNIQUE”]],...
The Docker containers are configured with a bridged networking setup. They bind the port 53 of the public interface of the slave host.
Testing Mesos-DNS
In order to be able to use services in the Mesos Cluster, it is sufficient to append the Mesos slaves IP (where Mesos-DNS is running) on top of /etc/resolv.conf file, like for any other DNS servers:
1
2
3
4
5
6
7
8
9
|
#
# This file is automatically generated.
#
nameserver 11.22.33.44 # Mesos-DNS01
nameserver 44.33.22.11 # Mesos-DNS02
nameserver 8.8.8.8 # DNS server
nameserver 8.8.9.9 # DNS server
|
Mesos-DNS will lookup for a given name request and in case this cannot be satisfied, it will forward the request down to the resolvers (8.8.8.8, 8.8.9.9 in the example).
Using Dig, it’s possible to verify that our requests are correctly served by the Mesos-DNS server (11.22.33.44 in the example):
1
2
3
4
5
6
7
8
9
10
11
12
|
$ dig app01.marathon.plista
..
;; QUESTION SECTION:
;app01.marathon.plista. IN A
;; ANSWER SECTION:
app01.marathon.plista. 60 IN A 88.88.88.88 #(Other Mesos Slave)
;; Query time: 0 msec
;; SERVER: 11.22.33.44#53(11.22.33.44)
|
Marathon and HAproxy
In the heterogeneous Mesos Cluster environment, each Mesos slave normally runs more than one application (container), each of them listening on its own port.
In our original idea we wanted to reproduce a working environment for our applications, as much compliant as we could with the original one, providing the applications with features such as HTTPS and Load balancing.
It was clear since the beginning that the guy capable to accomplish this mission was HAproxy, installed on each slave, with a system to always get a fresh assignment map “application/slave:port” from the framework (Marathon).
We obviously refer to HAproxy 1.6 which supports HTTPS.
The assignment map I am referring to, is retrievable by querying the Marathon API, exposed on port 8080:
1
2
3
4
5
6
7
8
9
10
|
curl http://marathon-host.domain.com:8080/v2/apps/app01
...
{
“appId”: “/app01”,
“host”: “slaveXX.domain.com”,
“id”: “app01.e8a834748-d2521-1225”,
“ports”: [ 35552 ],
}
|
Easy.
After providing each Mesos Slave with an HAproxy installation, we just needed a tool to perform that query against the Marathon API regulary and get an updated HAproxy configuration. We initially wrote our own cron–haproxy-updater custom script, which basically:
- query periodically Marathon via cron (* * * * *) to retrieve always the newest assignment map
- parse the Marathon reply output and write an haproxy.cfg file in a temporary location (/tmp)
- compare the current haproxy.cfg with /tmp/haproxy.cfg and in case of any changes, overwrites the file and reload
We worked with our custom cron-haproxy–updater for a while, still going on with the plan to move more applications into Mesos. However, running this Cron tool, pointed out an evident and blocking limitation. Relying on the Cron scheduler to deploy applications, was putting the application in a “potential downtime of 60 seconds”. These 60 seconds were (potentially) the time between an application deployment and the next haproxy refresh.
In the plan to migrate a critical service in the cluster, this 60 seconds time frame was a bad impediment.
This lead us to move to a conceptually similar, but amazingly better implemented solution.
Zero downtime deployments
marathon-lb is what essentially solved our problem. The biggest innovation brought by this Python toy, is the capability to stay constantly connected to the Marathon API and wait for events.
To make a long story short, each Mesos Slave is running its own marathon-lbinstance. Every time an application is deployed/destroyed via Marathon, one event is generated and all the HAproxy instances on the slaves, are triggered to refresh at the same time:
The persistent connection to the events endpoint on Marathon is provided by the option –sse which stands for Server Send Events, one of the possible ways to launch a marathon-lb instance. This “Zero downtime deployments” is the best fit in moving critical services into Mesos.
SSE is not the only benefit we got by adopting this solution actually. marathon-lbhas an interesting design. You can define static parts of your final haproxy.cfg, simply placing template files into a proper templates directory. Each file will correspond then to a specific section in haproxy.cfg:
* HAPROXY_HTTP_FRONTEND_HEAD (with the following content):
1
2
3
4
5
6
|
frontend apps_http
bind *:80
mode http
redirect scheme https code 301 if { hdr(Host) –i my_app.domain.com } !{ ssl_fc }
|
* HAPROXY_HTTPS_FRONTEND_HEAD:
1
2
3
4
5
6
|
frontend apps_https
bind *:443 ssl crt /etc/ssl/marathon_lb–certs/hp–lb.pem
mode http
use_backend my_app
|
* HAPROXY_HTTP_BACKEND
* HAPROXY_HTTP_FRONTEND_ACL
And so on. We maintain and distribute those template files via provisioning on each slave.
Those sections will keep that configuration as persistent. The rest of the haproxy.cfg file will be dynamically written/cancelled according to Marathonevents (applications deployed/destroyed/suspended).
The following is just an example of a haproxy.cfg file, that shows both staticand dynamic sections:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
+++++ STATIC (Configuration coming from static template files)
global
daemon
log /dev/log local0
<...>
mode http
stats enable
monitor–uri /_haproxy_health_check
frontend apps_http
bind *:80
mode http
redirect scheme https code 301 if { hdr(Host) –i static_app.domain.com } !{ ssl_fc }
frontend static_app_443
bind *:443 ssl crt /etc/ssl/marathon_lb–certs/hp–lb.pem
mode http
use_backend static_app
backend static_app
mode http
balance roundrobin
server mesos_slaveX_com 11.22.33.44:8080
server mesos_slaveY_com 44.33.22.11:8080
+++++ DYNAMIC (Populated by application events in Marathon “–sse”)
frontend app01
bind *:3000
mode http
use_backend app01_backend
frontend app02
bind *:4000
mode http
use_backend app02_backend
backend app01_3000
mode http
option forwardfor
server mesos_slaveX.domain_com_30000 22.33.44.55:3000
backend app02_4000
mode tcp
option forwardfor
server mesos_slaveY.domain_com_300001 33.44.55.66:4000
|
What’s next?
Prospects are several.
Now that applications are reachable in the Mesos cluster, we could also think to make some Devops processes benefits from it. We could start playing with our Continuous Integration (Jenkins), putting it in the cluster, allocating a bunch of resources and scaling instances as much as needed, spreading the builds over the nodes. It seems that eBay already did a remarkable job (much more than our goals actually) in that direction.
AUTHOR: SIMONE ROSELLI