An in-depth analysis of container networking and iptables

Last week, a friend asked in the group about the relationship between Docker and Iptables. Let’s talk about it in detail here.

Docker can provide us with very powerful and flexible network capabilities, largely due to its combination with iptables. When using it, you may not pay much attention to the role of iptables. This is because Docker has automatically completed the relevant configuration for us.

(MoeLove) ➜ ~ dockerd --help | grep iptables
      --iptables Enable addition of iptables rules ( default  true )

The docker daemon has a --iptablesparameter, which is used to control whether to automatically enable iptables rules. The default is set to on (true). So usually we don’t pay too much attention to its work.

In this article, in order to avoid environmental interference, I will use the docker in docker environment to introduce it. This environment can be started as follows:

(MoeLove) ➜ ~ docker run --rm -d --privileged docker:dind
f323aef7b532ba6d575ca6f9444a08f1a55f2447afec2e853954694c034e6ae0

Contents

1 iptables basics
2 Docker networking and iptables
3 containerd and iptables
4 Summarize

iptables basics

iptablesIs a tool for configuring Linux kernel firewalls that can be used to detect, modify forwarding, redirecting, and dropping IPv4 packets. It uses the ip_tables function of the kernel, so it requires the Linux 2.4+ version of the kernel.

At the same time, in order to facilitate management, iptables organizes multiple tables according to different purposes ; each table contains many predefined chains ; each chain contains rules for sequential traversal ; these rules also define matching rules for actions. and goals .

For users, what we usually need to interact with are chains and rules .

There is a classic diagram to understand the main workflow of iptables:

Image source: https://www.frozentux.net/ipt…

The lowercase letters above are tables , and the uppercase letters represent chains . Every IP packet coming in from any network port must pass through this picture from top to bottom.

Quoted from ArchWiki

However, this is not the focus of this article, so I will not expand on it. If you are interested in the content of iptables, please leave a message, and you can write a complete article later.

Docker networking and iptables

Next, let’s take a look at the specific differences between Docker when opening and closing iptables.

Turn off Docker’s iptables support

At the beginning of this article, I introduced to you that the docker daemon has a --iptablesparameter that is used to control whether to use iptables. We use the following commands to start a docker daemon and turn off iptables support.

(MoeLove) ➜ ~ docker run --rm -d --privileged docker:dind dockerd --iptables = false 
7135 a 54 c 913 af 5e9 ce 69 a 45 a 0819475503 ea 9e3 c 5 c 673 d 62 d 9 d 38 f 0 f 0896179 d

Enter this container and view all its iptables rules:

(MoeLove) ➜ ~ docker exec -it $(docker ps -ql) sh
/ # iptables-save    
# Generated by iptables-save v1.8.8 on Mon Dec 12 01:46:38 2022
*filter                                                                                              
:INPUT ACCEPT [0:0] 
:FORWARD ACCEPT [0:0] 
:OUTPUT ACCEPT [2:80]                                                                                
COMMIT
# Completed on Mon Dec 12 01:46:38 2022

It can be seen that when the docker daemon adds --iptables=falsethe parameter, there is no regular output by default.

Enable Docker’s iptables support

Use the following command to start a docker daemon. There is no explicit --iptablesoption passed here because it is the default true.

(MoeLove) ➜ ~ docker run --rm -d --privileged docker:dind             
 c 464 c 5 c 08 ecdf 9129 afbf 217 c 6462236089 fe 0 a 1 d 11 dfe 7700 c 2985 a 04 d 8 d 216

View its iptables rules:

(MoeLove) ➜ ~ docker exec -it $(docker ps -ql) sh
/ # iptables-save
# Generated by iptables-save v1.8.8 on Mon Dec 12 14:48:16 2022
*nat
:PREROUTING ACCEPT [0:0] 
:INPUT ACCEPT [0:0] 
:OUTPUT ACCEPT [1:40] 
:POSTROUTING ACCEPT [1:40] 
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.18.0.0/16 ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
COMMIT
# Completed on Mon Dec 12 14:48:16 2022
# Generated by iptables-save v1.8.8 on Mon Dec 12 14:48:16 2022
*filter
:INPUT ACCEPT [0:0] 
:FORWARD ACCEPT [0:0] 
:OUTPUT ACCEPT [2:80] 
:DOCKER - [0:0] 
:DOCKER-ISOLATION-STAGE-1 - [0:0] 
:DOCKER-ISOLATION -STAGE-2 - [0:0] 
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
COMMIT
# Completed on Mon Dec 12 14:48:16 2022

As you can see, it has several more chains than when iptables support was turned off just now:

DOCKER
DOCKER-ISOLATION-STAGE-1
DOCKER-ISOLATION-STAGE-2
DOCKER-USER

And some forwarding rules have been added, which will be introduced in detail below.

DOCKER-USER chain

Among the above-mentioned new chains, let’s first look at DOCKER-USER, which is the first to take effect.

*filter
 :DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
...
 -A DOCKER-USER -j RETURN

The above rules are effective in the filter table:

The first one is: -A FORWARD -j DOCKER-USERThis means that after the traffic enters the FORWARD chain, it directly enters the DOCKER-USER chain;
The last one -A DOCKER-USER -j RETURNmeans that after the traffic enters the DOCKER-USER chain for processing, (if there is no other processing) it can be RETURNed back to the original chain for subsequent rule matching.

This is actually a chain reserved by Docker for users to configure some additional rules.

Docker’s default routing rule allows all clients to access it. If your Docker is running on the public network, or you want to prevent the containers in Docker from being accessed by other clients in the LAN, then you need to add a rule here .
For example, you only allow access to 100.84.94.62, but deny access to other clients:

iptables -I DOCKER-USER -i <net interface> ! -s 100.84.94.62 -j DROP

In addition, Docker will clean and rebuild iptables-related rules during operations such as restarting, but the rules in the DOCKER-USER chain can be persisted and will not be affected.

The specific implementations are docker/libnetworkbelow . The following is DOCKER-USERthe relevant code about the chain:

const userChain = "DOCKER-USER"

func  arrangeUserFilterRule () {
     if ctrl == nil || !ctrl.iptablesEnabled() {
         return
    }
    iptable := iptables.GetIptable(iptables.IPv4)
    _, err := iptable.NewChain(userChain, iptables.Filter, false )
     if err != nil {
        logrus.Warnf( "Failed to create %s chain: %v" , userChain, err)
         return
    }

    if err = iptable.AddReturnRule(userChain); err != nil {
        logrus.Warnf( "Failed to add the RETURN rule for %s: %v" , userChain, err)
         return
    }

    err = iptable.EnsureJumpRule( "FORWARD" , userChain)
     if err != nil {
        logrus.Warnf( "Failed to ensure the jump rule for %s: %v" , userChain, err)
    }
}

You can see that the chain name is fixed in the code, and the chain and rules are created/ensured.

DOCKER-ISOLATION-STAGE-1/2 CHAIN

DOCKER-ISOLATION-STAGE-1/2 These two chains have similar functions and will be introduced together here.

*filter
 ...
 :DOCKER-ISOLATION-STAGE-1 - [0:0] 
:DOCKER-ISOLATION-STAGE-2 - [0:0] 
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
...
 -A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
...

These two chains are mainly separated by bridge networks in two stages. The so-called bridged network usually refers to docker0the network through this interface created by Docker.

/ # ifconfig docker0
docker0 Link encap:Ethernet HWaddr   02 : 42 : 11 : 31 : 97 : 0D
          inet addr: 172.18 . 0.1   Bcas t:172 . 18.255 . 255   Mask: 255.255 . 0.0 
          UP BROADCAST MULTICAST MTU: 1500   Metric: 1 
          RX packet s:0 error s:0 dropped: 0 overrun s:0 frame: 0 
          TX packet s :0 error s:0 dropped: 0 overrun s:0 carrier : 0 
          collision s:0 txqueuelen: 0  
          RX byte s:0 ( 0.0 B) TX byte s:0 ( 0.0 B)

Give an example to illustrate.

First create a moelovenetwork named and view its IP.

➜ ~ docker network create moelove
0d3d76dcf81fcf4b9d76ab5a7dec22737b115dddd593c73b27d27f0114cec1e2
➜ ~ docker run --rm -it --network moelove alpine
/ # hostname -i
172.22.0.2

Then use the default network and use the previously created network to start the container to ping the container IP created above.

➜ ~ docker run -- rm -it alpine ping -c1 -w2 172.22.0.2  
PING 172.22.0.2 (172.22.0.2): 56 data bytes

--- 172.22.0.2 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss


➜ ~ docker run -- rm -it --network moelove alpine ping -c1 -w2 172.22.0.2  
PING 172.22.0.2 (172.22.0.2): 56 data bytes
64 bytes from 172.22.0.2: seq =0 ttl=64 time=0.092 ms

--- 172.22.0.2 ping statistics ---
1 packet transmitted, 1 packet received, 0% packet loss
round-trip min/avg/max = 0.092/0.092/0.092 ms

It can be seen that if the containers are in the same network, they can be pinged successfully, but if they are containers in different networks, they cannot be pinged.

DOCKER-ISOLATION-STAGE-1 will first match the bridge from the bridge network, and the target is a different interface. If it matches, it will enter DOCKER-ISOLATION-STAGE-2. If it
does not match, it will return to the parent chain.

DOCKER-ISOLATION-STAGE-2 The matching target is the bridge of the bridge network. If it matches, it means that the data packet comes from a bridge of the bridge network,
the destination is the bridge of another bridge network, and DROPs it. . If there is no match, return to the parent chain.

Seeing this, you may ask why there are two stages of quarantine? Is it possible to directly isolate it with a chain?

The answer is yes, a chain can be isolated. This is what Docker did in its early versions.

But at that time, if there were more than 30 networks, Docker would start very slowly. So we later made this optimization
to reduce the complexity of this part from O(N^2) to O(2N). Docker will no longer start slowly.

DOCKER chain

Finally, let’s take a look at the DOCKER chain. This is the most frequently used chain in Docker and the chain with the most rules, but it is easy to understand.
Normally, if you accidentally delete the contents of this chain, it may cause network problems in the container, which can be solved manually or by restarting Docker.

Here we start a container and perform port mapping to see what changes will occur.

(MoeLove) ➜ ~ docker exec -it $(docker ps -ql) sh 
/ # docker run - p  6379 : 6379 --rm -d redi s:alpine 
Unable to  find image 'redis:alpine' locally
alpine: Pulling from library/redis
c158987b0551: Pull complete  
1 a990ecc86f0: Pull complete  
f2520a938316: Pull complete  
ae8c5b65b255: Pull complete  
1 f2628236ae0: Pull complete  
329 dd56817a5: Pull complete  
Digest: sha256 : 518 c024ec78b3074917bad2d4 0863e882e5297d65587e6d7c6e0b7281d9b8270
Status: Downloaded newer image for  redi s:alpine 
6 bf21bd3de78ce32617bf64a6a730c0fb50e304509a2ec3ef05ceae648334294
/ # docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6 bf21bd3de78    redi s:alpine    "docker-entrypoint.s…"    9 seconds ago Up 8 seconds    0.0 . 0.0 : 6379 -> 6379 /tcp friendly_spence

Then execute it again iptables-saveand compare the difference between the current result and the last time:

*filter
 +-A DOCKER -d 172.18.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 6379 -j ACCEPT
 *nat
+-A POSTROUTING -s 172.18.0.2/32 -d 172.18.0.2/32 -p tcp -m tcp --dport 6379 -j MASQUERADE 
+-A DOCKER ! -i docker0 -p tcp -m tcp --dport 6379 - jDNAT --to-destination 172.18.0.2:6379

Docker adds rules to filtertables and tables respectively. natIts specific meaning is as follows:

filterThis new rule in the table means: in the custom DOCKERchain, for the target address is 172.18.0.2 and does not docker0enter from but docker0goes out from , and the target port is 6379, the TCP protocol will be received.

To put it simply, it means to docker0allow the TCP protocol traffic with destination 172.18.0.2:6379 flowing out.

natThe representation of these two rules in the table:

Execute MASQUERADE action for the traffic with destination port 6379 on 172.18.0.2 (it can be simply understood as SNAT here);
In the custom DOCKERchain, if the entry is not docker0and the target port is 6379, a DNAT action is performed to convert the target address to 172.18.0.2:6379. To put it simply, this rule provides us with the capability of Docker container port forwarding, converting the destination address of traffic accessing the host’s local 6379 port to 172.18.0.2:6379.

Of course, to provide complete access capabilities, it also needs to be coordinated with other rules listed above.

In addition, since there are many different network drivers in Docker, there are some differences in other modes that need to be noted.

containerd and iptables

With the complete removal of dockershim from Kubernetes, many people have switched the container runtime to containerd, and some even hope to replace all Docker environments with containerd.
But there are actually some points that need attention here. For example, in our above example, port mapping (port publishing) is actually not possible in containerd.

In containerd, you can start the same container through a command similar to the above docker, such as:

$ ctr run docker.io /library/ redis:alpine redis- 1

But it has no -por -Pparameters. Therefore, the ability to publish this port is specifically provided by Docker itself.

If you really want to use this function, how to do it?

One way is to manage iptables rules yourself, but it is more cumbersome.

Another way is to use nerdctl directly , which is a tool specially made for containerd and
compatible with Docker CLI. Provides many ctrcapabilities that are far richer than the default tools.

For example:

$ nerdctl run -d --name redis-1 -p 6379:6379 redis:alpine

Get its IP which is 192.168.40.9, and then check the iptables rules:

$ iptables -t nat -L | grep  '192.168.40.9' 
CNI- 66888846605 aa0cf860a0834   all   --   192.168 . 40.9     anywhere             
DNAT tcp -- anywhere anywhere tcp dp t:redis  to : 192.168 . 40.9 : 6379

I found that there are similar rules so that it can be accessed normally.

Summarize

This article analyzes the relationship between Docker and iptables, and analyzes the iptables rules and their meanings that will be created after Docker is started. It also introduces the actual principle of Docker port mapping through examples,
and how to use nerdctl and containerd for port mapping.

Containers have a lot of network content, but the principles are the same, and similar content is also included in Kubernetes.

Okay, that’s the content of this article.

By mucktubeFebruary 14, 2023Computer