Summary of results:
While OpenStack deploys all the necessary components to create a working infrastructure the available ‘atomic’ images just do not work. You end up with all the VMs and new private networking required in place, but the OS’s installed are incorrectly configured. I believe this is an atomic image issue rather than an openstack one.
Additional considerations:
The “Atomic Host” project appears end of life. Fedora no longer produce Atomic images, releasing now only ‘bare’ CoreOS images. CoreOS images require a lot of extra configuration steps such as creating an ignition (configuration) file in ‘json’ format for each instance to enable even a simple ssh key insertion to allow you to login and if they ever come out of ‘preview’ may not be supported by OpenStack anyway.
The Good
The OpenStack Magnum Container Infrastructure deployment just works perfectly as far as setting up the environment is concerned
- it creates a new private network, along with routers between the nodes and to the external network as needed
- it creates needed security groups for Docker or Kubernetes deployments and attaches the group to the new instances
- it builds and starts the master and worker noddes needed (number of each defined simply by entering the number required in the horizon dashbord panel)
- you have a fully running infrastructure
The Irritating, OpenStack tweaks needed
- To use Atomic images in OpenStack after the images have been loaded you must use the openstack command openstack image set –property os_distro=fedora-atomic Unless that is done the images are not available for clusters and will not be selectable in the horizon cluster deployment panels for magnum deployment (and yes for all OS types the only allowed type is os_distro=fedora-atomic)
- You also must edit /etc/magnum/magnum.conf to set “default_docker_volume_type = ” to a cinder supported value (use ‘cinder type-list’ to show available options) or all attempts to create a cluster using cinder volumes will fail with volume type not found or invalid volume type () as the default is a blank entry
- if you elect to assign a floating ip-address during the cluster deployment every instance in the cluster is assigned a floating-ip so you need a lot of them; I would recomend not assigning floating ip and just adding one to the master(s) if needed after deployment as all cluster traffic chatting is done across the provate network
The Bad
- None of the Atomic Host images I used actually worked, with issues ranging from Docker being installed in a configuration preventing swarm mode (in a docker swarm test) to kubelet looping endlessly on startup (in a kubrernetes cluster test). These are issues with the Atomic Host images and not with OpenStack
- While OpenStack (stein) requires images for cluster deployment to be Atomic Host images the Atomic project is end-of-life being mergered/replaced with CoreOS (for example the last Fedora Atomic Host image was F29 and the download page says use CoreOS going forward). This is an issue as CoreOS (For Fedora anyway) is still in ‘preview’ so there is no working solution even if OpenStack implements support for CoreOS images
Details of the tests I ran
This post is on attempting to deploy a kubernetes cluster or docker swarm on OpenStack using the publically available atomic cloud images for CentOS and Fedora. Note that no problem resolution or reconfiguartion changes were made to try to get things working as this was an ‘out-of-the-box’ test.
Environment used:
OpenStack release: 'stein' CentOS7 atomic image: CentOS-Atomic-Host-7.20160130-GenericCloud.qcow2 Fedora29 atomic image: Fedora-AtomicHost-29-20181025.1.x86_64.qcow2 Cluster deployed for each test: One master and one worker VM VM Sizes: all cluster VMs had 756Mb memory allocated, disk sizes were the minumum 6Gb for F29 and 10Gb for C7
Cluster Templates used:
An example of the cluster templates used. Note C7 needs devicemapper storage and a docker minimum volume size for docker to run as it needs space in clinder storage, Fedora can use overlay allowing docker containers to not depend on cinder storage. You obviously need to update ‘external_network’ withe the name of your own openstack external network, set your own ssh key name, and create flavours based on the values I listed in theVM sizes above.
# rexray service fails to start for C7 atomic, # but template will not be created unless using driver rexray openstack coe cluster template create \ --coe swarm \ --image C7-Atomic-Host \ --keypair marks-keypair-stein \ --server-type vm \ --external-network external_network \ --public \ --network-driver docker \ --flavor C7-Atomic-Host-min \ --master-flavor C7-Atomic-Host-min \ --volume-driver rexray \ --docker-storage-driver devicemapper \ --docker-volume-size 3 \ C7-swarm-cluster-template # Requires cinder driver, fails to create using rexray openstack coe cluster template create \ --coe kubernetes \ --image C7-Atomic-Host \ --keypair marks-keypair-stein \ --server-type vm \ --external-network external_network \ --public \ --network-driver flannel \ --flavor C7-Atomic-Host-min \ --master-flavor C7-Atomic-Host-min \ --volume-driver cinder \ --docker-storage-driver devicemapper \ --docker-volume-size 3 \ C7-kubernetes-cluster-template # can use overlay for F29 docker openstack coe cluster template create \ --coe swarm \ --image F29-Atomic-Host \ --keypair marks-keypair-stein \ --server-type vm \ --external-network external_network \ --public \ --network-driver docker \ --flavor F29-Atomic-Host-min \ --master-flavor F29-Atomic-Host-min \ --volume-driver rexray \ --docker-storage-driver overlay \ F29-swarm-cluster-template openstack coe cluster template create \ --coe swarm \ --image F29-Atomic-Host \ --keypair marks-keypair-stein \ --server-type vm \ --external-network external_network \ --public \ --network-driver flannel \ --flavor F29-Atomic-Host-min \ --master-flavor F29-Atomic-Host-min \ --volume-driver rexray \ --docker-storage-driver overlay \ F29-kubernetes-cluster-template
Observered during the testing
CentOS7 Atomic Image – Docker Swarm
Atomic host image seems to be configured incorrectly. The docker service will not start as the rexray service is not running as docker has a dependancy on docker-storage-setup.service, but the rexray service has a dependance on the docker service running. Simple an endless loop trying/failing to start dervices with everything failing due to unsatisfied dependencies.
As these are atomic images the files simply cannot be edited to resolve the issue. So this atomic image cannot be used for docker swarm deployments.
Fedora 29 Atomic Image – Docker Swarm
Using the defaults (including the ‘flannel’ network driver the cluster builds ok, but docker does not run.
[root@marks-swarm-sstcju3tnss4-master-0 ~]# systemctl start docker Failed to start docker.service: Unit flanneld.service not found.
Re-doing the cluster deployment using the ‘docker’ driver the cluster builds ok, docker is running OK on both VMs.
However on both master and worker ‘docker info’ shows ‘swarm inactive’ and the way docker has been installed/configured prevents it being used in a docker swarm.
[root@marks-swarm-jv5gugeihibm-master-0 ~]# docker node ls Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again. [root@marks-swarm-jv5gugeihibm-master-0 ~]# docker swarm init --advertise-addr 10.0.1.138 Error response from daemon: --cluster-store and --cluster-advertise daemon configurations are incompatible with swarm mode
As these are atomic images the configuration files simply cannot be edited to resolve the issue. So this atomic image cannot be used for docker swarm deployments.
CentOS7 Atomic Image – Kubernetes Cluster
On the one occasion the cluster build completed the kubelet service fails to start, hits the spawning/failing too fast limit, no obvious errors, not bothering to debug.
All other attempts to deploy the same cluster timed out; cluster build timeout was set to 3hrs.
Fedora 29 Atomic Image – Kubernetes Cluster
Timed out after 40mins on a ‘wait’ in the install script, cluster build timeout was set to 120mins on that attempt so it was a script timeout. See the additional notes below.
Additional Notes on Kubernetes in the atomic images
It should be noted that the atomic host images do not actually contain kubernetes and in fact have to download them, as seen from activity on the vm servers in the new cluster…
[root@marks-k8-2v4y6qj2weu5-master-0 ~]# ps -ef | grep install root 2075 1397 16 00:41 ? 00:00:23 /usr/bin/python3 -Es /usr/bin/atomic install --storage ostree --system --system-package=no --name=kube-proxy docker.io/openstackmagnum/kubernetes-proxy:v1.11.6 [root@marks-k8-2v4y6qj2weu5-master-0 ~]# ps -ef | grep install root 2366 2348 30 00:45 ? 00:00:21 /usr/bin/python3 -Es /usr/bin/atomic install --storage ostree --system --system-package no --set REQUESTS_CA_BUNDLE=/etc/pki/tls/certs/ca-bundle.crt --name heat-container-agent docker.io/openstackmagnum/heat-container-agent:stein-dev
The time taken to download the packages may be what causes the timeout failure on the cluster build.
Alternatives, just use your own stack and cloud images
Using an existing private network I manually created a heat stack configuration for a docker swarm of one master and one additional worker (to match the configuration used in tests done above) using a normal cloud-image image. This took under ten minutes to create and test so doing things manually is almost as fast as using Magnum with the advantages that it does not use atomic images but normal cloud images, plus it works.
The only manual step needed after deployment is that the ‘worker’ node (or nodes if you deploy more than one) must be manually joined to the swarm using the token logged on the master as there is no way of in the stack template passing that data between images. The easiest way of doing that would be to install ssh keys for root so each worker instance can scp the file with the token in it from the master and run it (as workers can depend on the master being built); alterately if your stack build is run inside a script on the openstack master you could have that script retrieve the token and push it to each worker and run the command. However for this test I just cut/paste and manually ran the command.
It should also be noted that if you were to have multiple worker nodes you would create a template and just source that into the cluster yaml file, but this is not a heat template post and I want everything visible in one file so I did it this way for this post.