Friday, June 30, 2017

KVM vhost performance tuning to enhance ADC VE throughput

As global Telecom companies start adopting ADC - Application delivery controller (Load balancer)  in their OpenStack environment, it becomes important to achieve high throughput for ADC VE instances, but unlike ADC hardware appliances, ADC VE runs in customer commodity hardware server with either Redhat/Ubuntu as Host OS and KVM as hypvervisor, so it becomes important to know the underlying technologies to tune the hypvervisor environment for best performance, here are some hands on experience tuning KVM vhost to achieve ideal throughput.

Lab equipment

Dell Poweredge R710 (16 cores) + Intel 82599 10G NIC + 72G RAM
Dell Poweredge R210 (8 cores) + Intel 82599 10G NIC + 32G RAM

Network setup:

 /external vlan|<------------->|eth1 <--->iperf client \
| Dell R710(ADC VE)           Dell R210               |
 \Internal vlan|<------------->|eth2 <--->iperf server /


Note since I only have two physical  servers, Dell R710 as host for ADC VE, I have to 
use Dell R210 as both iperf server and iperf client, so I used Linux network namespace to 
isolate the IP and route spaces so the iperf client packet can egress out the physical NIC eth1,
forwarded by BIGIP VE, back into physical NIC eth2 to be processed by iperf server, here is 
simple bash script  to setup linux network namespace:



#!/usr/bin/env bash

set -x

NS1="ns1"
NS2="ns2"
DEV1="em1"
DEV2="em2"
IP1="10.1.72.62"
IP2="10.2.72.62"
NET1="10.1.0.0/16"
NET2="10.2.0.0/16"
GW1="10.1.72.1"
GW2="10.2.72.1"

if [[ $EUID -ne 0 ]]; then
    echo "You must be root to run this script"
    exit 1
fi

# Remove namespace if it exists.
ip netns del $NS1 &amp;&gt;/dev/null
ip netns del $NS2 &amp;&gt;/dev/null

# Create namespace
ip netns add $NS1
ip netns add $NS2

#add physical interface to namespace
ip link set dev $DEV1  netns $NS1
ip link set dev $DEV2  netns $NS2




# Setup namespace IP .
ip netns exec $NS1 ip addr add $IP1/16 dev $DEV1
ip netns exec $NS1 ip link set $DEV1 up
ip netns exec $NS1 ip link set lo up
ip netns exec $NS1 ip route add $NET2 via $GW1 dev $DEV1

ip netns exec $NS2 ip addr add $IP2/16 dev $DEV2
ip netns exec $NS2 ip link set $DEV2 up
ip netns exec $NS2 ip link set lo up
ip netns exec $NS2 ip route add $NET1 via $GW2 dev $DEV2

# Enable IP-forwarding.
echo 1 &gt; /proc/sys/net/ipv4/ip_forward

# Get into namespace
#ip netns exec ${NS} /bin/bash --rcfile &lt;(echo "PS1=\"${NS}&gt; \"")

On ADC VE I setup a simple forwarding virtual server to simply forward the packet, this is 
default throughput output without any performance tuning:

ns1&gt; /home/dpdk/iperf -c 10.2.72.62 -l 1024 -P 64
...............
................
[ 25]  0.0-10.2 sec  46.0 MBytes  37.9 Mbits/sec
[SUM]  0.0-10.2 sec  3.22 GBytes  2.72 Gbits/sec &lt;======= 2.72Gbits

here is the top output of vhost dataplane kernel thread for the ADC VE look like while passing traffic:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                 P
23329 libvirt+  20   0 35.366g 0.030t  23396 S 262.5 43.4 153:31.10 qemu-system-x86_64 -enable-kvm -name bigip-virtio -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 31357 -realtime m+  1
23332 root      20   0       0      0      0 R  17.9  0.0   1:35.98 [vhost-23329]                                                                                                           1
23336 root      20   0       0      0      0 R  17.9  0.0   1:18.20 [vhost-23329]


as you can see there are two vhost kernel thread showing up with 17.9% CPU usage, which indicates
vhost is not fully scheduled to pass data traffic for the guest machine. I have defined 4 tx/rx queues pair
for the macvtap on the physical 10G interface and two macvtap assigned to the ADC VE for external and internal vlan
, ideally, there should be 8 vhost kernel threads showing up from top that is fully scheduled to pass traffic

for example the interface xml dump  as below:


    

<interface type='bridge'>
      <mac address='52:54:00:55:47:05'/>
      <source bridge='br0'/>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <interface type='direct'>
      <mac address='52:54:00:f9:98:e9'/>
      <source dev='enp4s0f0' mode='vepa'/>
      <target dev='macvtap2'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </interface>
    <interface type='direct'>
      <mac address='52:54:00:4b:06:c4'/>
      <source dev='enp4s0f1' mode='vepa'/>
      <target dev='macvtap3'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <alias name='net2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </interface>

vCPU pin assigned

root@Dell710:~# virsh vcpupin bigip-virtio
VCPU: CPU Affinity
----------------------------------
   0: 0
   1: 2
   2: 4
   3: 6
   4: 8
   5: 10
   6: 12
   7: 14
   8: 2
   9: 4

 vhost cpu pin:

~#  virsh emulatorpin bigip-virtio
emulator: CPU Affinity
----------------------------------
       *: 0,2,4,6,8,10,12,14

NUMA node:

# lscpu --parse=node,core,cpu
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# Node,Core,CPU
0,0,0
1,1,1
0,2,2
1,3,3
0,4,4
1,5,5
0,6,6
1,7,7
0,0,8
1,1,9
0,2,10
1,3,11
0,4,12
1,5,13
0,6,14
1,7,15

so the odd CPU is on NUMA node 1, even CPU is on NUMA node 0, guest is pined to NUMA node 0 and vhost is pined to NUMA node 0 too
which should be good. why the lower throughput.

lets try assign the vhost to NUMA node 1 CPU:

# virsh emulatorpin bigip-virtio 1,3,5,7,9,11,13,15


#  virsh emulatorpin bigip-virtio
emulator: CPU Affinity
----------------------------------
       *: 1,3,5,7,9,11,13,15
now runs the test again:
[SUM]  0.0-10.1 sec  10.1 GBytes  8.58 Gbits/sec <=========8.58G, big difference!!!


  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                  P
23344 libvirt+  20   0 35.350g 0.030t  23396 R 99.9 43.4  15:40.95 qemu-system-x86_64 -enable-kvm -name bigip-virtio -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 31357 -realtime ml+  6
23341 libvirt+  20   0 35.350g 0.030t  23396 R 99.9 43.4  17:39.58 qemu-system-x86_64 -enable-kvm -name bigip-virtio -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 31357 -realtime ml+  0
23346 libvirt+  20   0 35.350g 0.030t  23396 R 99.9 43.4  15:23.76 qemu-system-x86_64 -enable-kvm -name bigip-virtio -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 31357 -realtime ml+ 10
23347 libvirt+  20   0 35.350g 0.030t  23396 R 99.9 43.4  15:29.99 qemu-system-x86_64 -enable-kvm -name bigip-virtio -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 31357 -realtime ml+ 12
23345 libvirt+  20   0 35.350g 0.030t  23396 R 99.7 43.4  15:29.29 qemu-system-x86_64 -enable-kvm -name bigip-virtio -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 31357 -realtime ml+  8
23348 libvirt+  20   0 35.350g 0.030t  23396 R 99.7 43.4  15:42.95 qemu-system-x86_64 -enable-kvm -name bigip-virtio -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 31357 -realtime ml+ 14
23342 libvirt+  20   0 35.350g 0.030t  23396 R 98.7 43.4  14:58.66 qemu-system-x86_64 -enable-kvm -name bigip-virtio -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 31357 -realtime ml+  2
23343 libvirt+  20   0 35.350g 0.030t  23396 R 96.0 43.4  14:58.54 qemu-system-x86_64 -enable-kvm -name bigip-virtio -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 31357 -realtime ml+  4
23332 root      20   0       0      0      0 R 40.2  0.0   1:12.12 [vhost-23329]                                                                                                           15
23333 root      20   0       0      0      0 R 40.2  0.0   1:05.58 [vhost-23329]                                                                                                           13
23335 root      20   0       0      0      0 R 40.2  0.0   1:04.98 [vhost-23329]                                                                                                            3
23334 root      20   0       0      0      0 R 39.2  0.0   1:04.52 [vhost-23329]                                                                                                            1
23337 root      20   0       0      0      0 R 32.2  0.0   0:47.66 [vhost-23329]                                                                                                           11
23339 root      20   0       0      0      0 R 31.6  0.0   0:50.47 [vhost-23329]                                                                                                           15
23336 root      20   0       0      0      0 S 31.2  0.0   0:56.08 [vhost-23329]                                                                                                            5
23338 root      20   0       0      0      0 R 30.2  0.0   0:49.52 [vhost-23329] 


this tells that something in host kernel is using NUMA node 0 CPUs that 8 vhost thread unable to get scheduled more
enough to process the data traffic. my theory is that physical NIC IRQ is spread to even cores on NUMA node 0 and softirq runs 
high on even cores, the vhost kernel thread didn't get enough time to run on even core, assigning the vhost to idle cores in 
NUMA node 1 so vhost get enough CPU cycles to process the data packet
      
 




Followers