在开始本章阅读之前,需要提前了解以下的知识
- 阅读本节需要一些docker的基础知识,最好是在linux上安装好docker环境。
- 提前掌握iptables的基础知识,前文参考【iptables 实战】
一、docker网络模型
docker网络模型如下图所示
说明:
- 上图中有两个容器,container1和container2,两个容器各自有一个网卡
- 两个容器通过docker0网桥进行互通。它们在同一个局域网,ip分别是172.17.0.2和172.17.0.3
- docker0网桥是什么,其实就是一个交换机,网络包在容器之间通过二层网络进行互通
在 Linux 中,能够起到虚拟交换机作用的网络设备,是网桥(Bridge)。它是一个工作在数据链路层(Data Link)的设备,主要功能是根据 MAC 地址学习来将数据包转发到网桥的不同端口(Port)上
二、容器网络互通实验
我们通过docker安装一个kafka消息中间件,kafka中间件需要zookeeper的支持。所以我们在一台虚拟机上安装两个容器应用,zookeeper和kafka。zookeeper为kafka提供服务。
三分钟安装一个kafka
安装过程见上面的链接
2.1本机网络查看
按上面安装好了以后,我们先不启动容器(可以先通过docker stop 命令将容器停止),直接看一下linux宿主机器上的网络信息
[root@localhost ~]# ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255inet6 fe80::42:6ff:fe21:5ecb prefixlen 64 scopeid 0x20<link>ether 02:42:06:21:5e:cb txqueuelen 0 (Ethernet)RX packets 68 bytes 3888 (3.7 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 112 bytes 8883 (8.6 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255inet6 fe80::a00:27ff:fe1d:60a9 prefixlen 64 scopeid 0x20<link>ether 08:00:27:1d:60:a9 txqueuelen 1000 (Ethernet)RX packets 114 bytes 16795 (16.4 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 172 bytes 16485 (16.0 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp0s8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.56.201 netmask 255.255.255.0 broadcast 192.168.56.255inet6 fe80::db6e:9a5d:7349:6075 prefixlen 64 scopeid 0x20<link>ether 08:00:27:c3:0a:37 txqueuelen 1000 (Ethernet)RX packets 401 bytes 32801 (32.0 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 294 bytes 34565 (33.7 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536inet 127.0.0.1 netmask 255.0.0.0inet6 ::1 prefixlen 128 scopeid 0x10<host>loop txqueuelen 1000 (Local Loopback)RX packets 0 bytes 0 (0.0 B)RX errors 0 dropped 0 overruns 0 frame 0TX packets 0 bytes 0 (0.0 B)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
上面代码中显示有几个网络设备
- docker0:容器的网桥
- enp0s3和enp0s8:这两个实际上是物理机的两个网卡
- lo:localhost,即本机
2.2启动两个容器应用zookeeper和kafka
[root@localhost ~]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0d5cb60e3a06 bitnami/rabbitmq "/opt/bitnami/script…" 13 days ago Exited (0) 4 minutes ago rabbitmq
43a5066a11f5 bitnami/zookeeper "/opt/bitnami/script…" 13 days ago Exited (143) 11 days ago zookeeper
922e61e655f6 bitnami/kafka:latest "/opt/bitnami/script…" 2 weeks ago Exited (137) 23 minutes ago kafka
2290b7d3a4ff nginx:latest "/docker-entrypoint.…" 2 months ago Exited (0) 2 months ago mynginx
上面显示,我已经运行过的容器,我们运行zookeeper和kafka
[root@localhost ~]# docker start zookeeper
zookeeper
[root@localhost ~]# docker start kafka
kafka
启动两个容器应用
2.3再看一下本机网络
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255inet6 fe80::42:6ff:fe21:5ecb prefixlen 64 scopeid 0x20<link>ether 02:42:06:21:5e:cb txqueuelen 0 (Ethernet)RX packets 336 bytes 43788 (42.7 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 323 bytes 48881 (47.7 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255inet6 fe80::a00:27ff:fe1d:60a9 prefixlen 64 scopeid 0x20<link>ether 08:00:27:1d:60:a9 txqueuelen 1000 (Ethernet)RX packets 134 bytes 18385 (17.9 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 196 bytes 18435 (18.0 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp0s8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.56.201 netmask 255.255.255.0 broadcast 192.168.56.255inet6 fe80::db6e:9a5d:7349:6075 prefixlen 64 scopeid 0x20<link>ether 08:00:27:c3:0a:37 txqueuelen 1000 (Ethernet)RX packets 565 bytes 45134 (44.0 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 394 bytes 45995 (44.9 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536inet 127.0.0.1 netmask 255.0.0.0inet6 ::1 prefixlen 128 scopeid 0x10<host>loop txqueuelen 1000 (Local Loopback)RX packets 0 bytes 0 (0.0 B)RX errors 0 dropped 0 overruns 0 frame 0TX packets 0 bytes 0 (0.0 B)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0veth164e95d: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet6 fe80::1441:abff:feb2:fc36 prefixlen 64 scopeid 0x20<link>ether 16:41:ab:b2:fc:36 txqueuelen 0 (Ethernet)RX packets 99 bytes 21233 (20.7 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 124 bytes 16191 (15.8 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0vethda42807: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet6 fe80::183c:e8ff:feae:1af2 prefixlen 64 scopeid 0x20<link>ether 1a:3c:e8:ae:1a:f2 txqueuelen 0 (Ethernet)RX packets 169 bytes 22419 (21.8 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 122 bytes 28133 (27.4 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255ether 52:54:00:ae:75:56 txqueuelen 1000 (Ethernet)RX packets 0 bytes 0 (0.0 B)RX errors 0 dropped 0 overruns 0 frame 0TX packets 0 bytes 0 (0.0 B)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
发现多了两个网络设备veth164e95d和vethda42807,这两个设备
我的虚拟机是centos8,可以通过bridge link看一下网络设备情况(centos7 用brctl show命令可以看)。发现网络设备veth164e95d和vethda42807是连接到了docker0网桥上的。
[root@localhost ~]# bridge link
18: veth164e95d@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master docker0 state forwarding priority 32 cost 2
20: vethda42807@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master docker0 state forwarding priority 32 cost 2
Docker 项目会默认在宿主机上创建一个名叫 docker0 的网桥,凡是连接在 docker0 网桥上的容器,就可以通过它来进行通信。
可是,我们又该如何把这些容器“连接”到 docker0 网桥上呢?
这时候,我们就需要使用一种名叫Veth Pair的虚拟设备了。
Veth Pair 设备的特点是:它被创建出来后,总是以两张虚拟网卡(Veth Peer)的形式成对出现的。并且,从其中一个“网卡”发出的数据包,可以直接出现在与它对应的另一张“网卡”上,哪怕这两个“网卡”在不同的 Network Namespace 里
veth164e95d和vethda42807这两个在宿主机里的设备,另一端分别连接着容器里的网卡。只要容器里的网卡发出一个报文,分别都分在veth164e95d和vethda42807上出现。
2.4容器互通网络分析
先看一下容器运行情况
我们把kafka容器9092端口映射到了宿主机的9092端口。kafka客户端是可以通过9092连接kafka中间件的
[root@localhost ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
43a5066a11f5 bitnami/zookeeper "/opt/bitnami/script…" 2 weeks ago Up 6 minutes 2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 8080/tcp zookeeper
922e61e655f6 bitnami/kafka:latest "/opt/bitnami/script…" 2 weeks ago Up 5 minutes 0.0.0.0:9092->9092/tcp, :::9092->9092/tcp kafka
再看一下kafka和zookeeper的网络情况
[root@localhost ~]# docker inspect kafka
....省略....
"Networks":
{"bridge": {"IPAMConfig": null,"Links": null,"Aliases": null,"NetworkID": "6b81b63148c199d79c62758e548a80732b9401231ccd741783c220077a1d7a93","EndpointID": "9824ca7180c438118e70be86d055b02c74f7ea82225db7c9be264e43ee5e6d32","Gateway": "172.17.0.1","IPAddress": "172.17.0.3","IPPrefixLen": 16,"IPv6Gateway": "","GlobalIPv6Address": "","GlobalIPv6PrefixLen": 0,"MacAddress": "02:42:ac:11:00:03","DriverOpts": null}
}
可以看到kafka的ip是172.17.0.3,网关是172.17.0.1
再看一下zookeeper
[root@localhost ~]# docker inspect zookeeper
....省略...."Networks": {"bridge": {"IPAMConfig": null,"Links": null,"Aliases": null,"NetworkID": "6b81b63148c199d79c62758e548a80732b9401231ccd741783c220077a1d7a93","EndpointID": "0b057f5d03cfd775de26a2de03d707e6b5b84fd0321b2d298a5399516cb75acc","Gateway": "172.17.0.1","IPAddress": "172.17.0.2","IPPrefixLen": 16,"IPv6Gateway": "","GlobalIPv6Address": "","GlobalIPv6PrefixLen": 0,"MacAddress": "02:42:ac:11:00:02","DriverOpts": null}
}
zookeeper的ip是172.17.0.2,网关是172.17.0.1
现在,再来看这个图,是不是更明了了
得出结论一:同一宿主机的不同容器,可以通过docker0网桥互通
三、宿主机是如何访问容器的
通过上面分析,容器间通过docker0网桥可以进行互通。那么宿主机是如何访问到容器的呢
[root@localhost ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.2.2 0.0.0.0 UG 100 0 0 enp0s3
0.0.0.0 192.168.56.100 0.0.0.0 UG 101 0 0 enp0s8
10.0.2.0 0.0.0.0 255.255.255.0 U 100 0 0 enp0s3
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.56.0 0.0.0.0 255.255.255.0 U 101 0 0 enp0s8
192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0
通过route -n命令,可以查看宿主机的路由规则,其中有一条,172.17.0.0网段,会通过docker0将包发出去。
我们尝试ping 一下172.17.0.2,并且新开一个窗口,通过tcpdump抓包看一下
[root@localhost ~]# ping 172.17.0.2
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.176 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=0.120 ms
64 bytes from 172.17.0.2: icmp_seq=3 ttl=64 time=0.134 ms
可以看到,通过宿主机上的docker0网桥,网络报文可以直达容器内部。
[root@localhost ~]# tcpdump -i docker0 -nn icmp
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
00:54:22.019423 IP 172.17.0.1 > 172.17.0.2: ICMP echo request, id 9341, seq 1, length 64
00:54:22.019492 IP 172.17.0.2 > 172.17.0.1: ICMP echo reply, id 9341, seq 1, length 64
00:54:23.033807 IP 172.17.0.1 > 172.17.0.2: ICMP echo request, id 9341, seq 2, length 64
得出结论二:宿主机访问容器可以通过172.17.0.0网段,而这个网段有一个路由规则,将该网段的报文发给docker0网桥,从而进入容器内部
四、容器内部是如何和外部网络互通的
为了方便演示,这一次我们启一个nginx容器
[root@localhost ~]# docker run -d -p 8080:80 --name mynginx nginx:latest
[root@localhost ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2290b7d3a4ff nginx:latest "/docker-entrypoint.…" 2 months ago Up 6 seconds 0.0.0.0:8080->80/tcp, :::8080->80/tcp mynginx
容器内部的80端口映射到宿主机的8080端口。通过宿主机的ip可以访问成功,如下图所示
网络包是如何通过外部到达容器里面的呢?先大胆猜想一下,应该是网络包到达机器时,经过目目标地址转换,将访问宿主机的网络包的目的地址改写,然后经过docker0网桥,这样就能访问到容器内部了。
既然是网络地址转换,那就是nat,我们查看一下iptables nat规则
[root@localhost ~]# iptables -t nat -nvL
Chain PREROUTING (policy ACCEPT 211 packets, 19122 bytes)pkts bytes target prot opt in out source destination 84 5992 DOCKER all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCALChain INPUT (policy ACCEPT 74 packets, 4424 bytes)pkts bytes target prot opt in out source destination Chain POSTROUTING (policy ACCEPT 691 packets, 54705 bytes)pkts bytes target prot opt in out source destination 0 0 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0 669 52735 LIBVIRT_PRT all -- * * 0.0.0.0/0 0.0.0.0/0 0 0 MASQUERADE tcp -- * * 172.17.0.2 172.17.0.2 tcp dpt:80Chain OUTPUT (policy ACCEPT 688 packets, 54549 bytes)pkts bytes target prot opt in out source destination 0 0 DOCKER all -- * * 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCALChain LIBVIRT_PRT (1 references)pkts bytes target prot opt in out source destination 10 695 RETURN all -- * * 192.168.122.0/24 224.0.0.0/24 0 0 RETURN all -- * * 192.168.122.0/24 255.255.255.255 0 0 MASQUERADE tcp -- * * 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-655350 0 MASQUERADE udp -- * * 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-655350 0 MASQUERADE all -- * * 192.168.122.0/24 !192.168.122.0/24 Chain DOCKER (2 references)pkts bytes target prot opt in out source destination 72 4320 RETURN all -- docker0 * 0.0.0.0/0 0.0.0.0/0 3 156 DNAT tcp -- !docker0 * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:80
iptables 规则分析
进入的流量分析
- PREROUTING 链引用了一个自定义链DOCKER
- 再来看一下DOCKER自定义链,有一个DNAT规则,即目的地址转换,非docker0网卡进来的报文,且端口为8080的,那么就将目标地址改写为172.17.0.2:80
- 上面我们的【结论二:宿主机访问容器可以通过172.17.0.0网段,而这个网段有一个路由规则,将该网段的报文发给docker0网桥,从而进入容器内部】可以得出,外部流量此时就可以进入容器了
得出结论三:容器内部和外部互通,外部流量访问到宿主机的ip和端口,会由PREROUTING链,进行源地址转换,这样就能进入容器内部
出去的流量分析
- 出去的流量,肯定是要经过snat源地址转换,转换成宿主机的地址的
- 可以看到下面的动态snat,即MASQUERADE
Chain POSTROUTING (policy ACCEPT 691 packets, 54705 bytes)pkts bytes target prot opt in out source destination 0 0 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0 669 52735 LIBVIRT_PRT all -- * * 0.0.0.0/0 0.0.0.0/0 0 0 MASQUERADE tcp -- * * 172.17.0.2 172.17.0.2 tcp dpt:80
看第一条规则,172.17.0.0出去的,非docker0出去的报文,做源地址转换。这样出去的报文的源地址,就是宿主机的ip和端口,而不是容器的172.17.0.0这个网段的地址了。
得出结论四:容器内部的流量出去,会在POSTROUTING链,做源地址snat,这样,客户端访问nginx收到的返回报文,会被欺骗,以为是宿主机发出来的