scaling: stability: 1000 containers after apprx 3 hours shows issue #807
Description
Description of problem
Using the fast footprint test, I launched 1000 containers, and left them running overnight. By the morning, things had crashed.
Expected result
These are pretty benign containers (a busybox doing 'notihing'), and the system is pretty large and not resource constrained afaict (88 cores, 384Gb or RAM). I'd expect the containers to stay up, pretty much forever.
Actual result
Something has 'died', and it looks like the kata runtime has become non-functional.
The first time I ran this test I believe iirc I ended up with 847 'live' containers in the morning. This time things crashed out. Logs below.
What did I run
For reference, I used this script to run the test and try to capture details upon death:
#!/bin/bash
# set our basic premise...
export NUM_CONTAINERS=1000
export MAX_MEMORY_CONSUMED="300*1024*1024*1024"
NAP_TIME=30
JOURNAL_ENTRIES=100
DMESG_ENTRIES=20
# Grab a system status snapshot - to stdout
snapshot() {
local journal="$(sudo journalctl -n ${JOURNAL_ENTRIES} --no-pager)"
local dm="$(dmesg -H | tail -${DMESG_ENTRIES})"
local dkr="$(docker ps -a)"
local kata="$(kata-runtime kata-env)"
echo "Sanity log file [$fname]"
echo "------------------------"
echo "---- kata env"
echo "${kata}"
echo "---- dmesg tail"
echo "${dm}"
echo "---- Docker ps -a"
echo "${dkr}"
echo "---- journal tail"
echo "${journal}"
echo "------------------------"
}
echo "fast footprint sanity test"
echo " NUM_CONTAINERS=${NUM_CONTAINERS}"
echo " MAX_MEMORY_CONSUMED=${MAX_MEMORY_CONSUMED}"
echo " NAP_TIME=${NAP_TIME}"
echo " JOURNAL_ENTRIES=${JOURNAL_ENTRIES}"
echo " DMESG_ENTRIES=${DMESG_ENTRIES}"
echo "Launching the containers"
# launch the containers
bash ./fast_footprint.sh
#Take a status snapshot
snapshot > sanity_launched.log
echo "entering sanity check loop"
# enter the sanity poll loop
while :; do
sleep ${NAP_TIME}
ts=$(date -Iseconds)
cnt=$(docker ps -q | wc -l)
echo -n "[${ts}] of ${NUM_CONTAINERS} have ${cnt}"
if [ $cnt == $NUM_CONTAINERS ]; then
echo " OK"
else
echo " FAIL"
snapshot > sanity_fail.log
exit 1
fi
done
What did I see
From the logs then...
Wating for KSM to settle...
............................................................................................................................................................................................................................................................................................................Timed out after 300s waiting for KSM to settle
and dodge death (>> I have no idea where this comes from btw - I'll check some time ;-) <<)
entering sanity check loop
[2018-10-04T08:51:50-07:00] of 1000 have 1000 OK
... about 3hours later ? ....
[2018-10-04T12:04:23-07:00] of 1000 have 1000 OK
runtime/cgo: pthread_create failed: No space left on device
SIGABRT: abort
PC=0x7fe6f3fdb428 m=19 sigcode=18446744073709551610
goroutine 0 [idle]:
runtime: unknown pc 0x7fe6f3fdb428
stack: frame={sp:0x7fe6d67fba08, fp:0x0} stack=[0x7fe6d5ffc2f0,0x7fe6d67fbef0)
00007fe6d67fb908: 00007fe6f49be168 00007fe6d67fba68
00007fe6d67fb918: 00007fe6f47a1b1f 0000000000000003
00007fe6d67fb928: 00007fe6f49ae5f8 0000000000000005
00007fe6d67fb938: 000000000249b040 00007fe6bc0008c0
00007fe6d67fb948: 00000000000000f1 0000000000000011
00007fe6d67fb958: 0000000000000000 00000000019c71ff
00007fe6d67fb968: 00007fe6f47a6ac6 0000000000000005
00007fe6d67fb978: 0000000000000000 0000000100000000
00007fe6d67fb988: 00007fe6f3facde0 00007fe6d67fbb20
00007fe6d67fb998: 00007fe6f47ae923 00000000000000ff
00007fe6d67fb9a8: 0000000000000000 0000000000000000
00007fe6d67fb9b8: 0000000000000000 2525252525252525
00007fe6d67fb9c8: 2525252525252525 0000000000000000
00007fe6d67fb9d8: 00007fe6f436b700 00000000019c71ff
00007fe6d67fb9e8: 00007fe6bc0008c0 00000000000000f1
00007fe6d67fb9f8: 0000000000000011 0000000000000000
00007fe6d67fba08: <00007fe6f3fdd02a 0000000000000020
00007fe6d67fba18: 0000000000000000 0000000000000000
00007fe6d67fba28: 0000000000000000 0000000000000000
00007fe6d67fba38: 0000000000000000 0000000000000000
00007fe6d67fba48: 0000000000000000 0000000000000000
00007fe6d67fba58: 0000000000000000 0000000000000000
00007fe6d67fba68: 0000000000000000 0000000000000000
00007fe6d67fba78: 0000000000000000 0000000000000000
00007fe6d67fba88: 0000000000000000 0000000000000000
00007fe6d67fba98: 0000000000000000 0000000000000000
00007fe6d67fbaa8: 00007fe6f401ebff 00007fe6f436b540
00007fe6d67fbab8: 0000000000000001 00007fe6f436b5c3
00007fe6d67fbac8: 00000000000000f1 0000000000000011
00007fe6d67fbad8: 00007fe6f4020409 000000000000000a
00007fe6d67fbae8: 00007fe6f409d2dd 000000000000000a
00007fe6d67fbaf8: 00007fe6f436c770 0000000000000000
runtime: unknown pc 0x7fe6f3fdb428
stack: frame={sp:0x7fe6d67fba08, fp:0x0} stack=[0x7fe6d5ffc2f0,0x7fe6d67fbef0)
00007fe6d67fb908: 00007fe6f49be168 00007fe6d67fba68
... and more stack dumps....
I've attached the full log as 'death.log'.
death.log
Also, if I try to use the runtime to list how many containers are still running:
gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ sudo kata-runtime list
runtime/cgo: pthread_create failed: No space left on device
Aborted (core dumped)
I've uploaded the output from the kata collect as an attachment:
collect.log