scaling: stability: 1000 containers after apprx 3 hours shows issue

# Description of problem

Using the [fast footprint test](https://github.com/kata-containers/tests/pull/766), I launched 1000 containers, and left them running overnight. By the morning, things had crashed.

# Expected result

These are pretty benign containers (a busybox doing 'notihing'), and the system is pretty large and not resource constrained afaict (88 cores, 384Gb or RAM). I'd expect the containers to stay up, pretty much forever.

# Actual result

Something has 'died', and it looks like the kata runtime has become non-functional.
The first time I ran this test I believe iirc I ended up with 847 'live' containers in the morning. This time things crashed out. Logs below.

# What did I run

For reference, I used this script to run the test and try to capture details upon death:

```bash
#!/bin/bash

# set our basic premise...
export NUM_CONTAINERS=1000
export MAX_MEMORY_CONSUMED="300*1024*1024*1024"

NAP_TIME=30
JOURNAL_ENTRIES=100
DMESG_ENTRIES=20

# Grab a system status snapshot - to stdout
snapshot() {
        local journal="$(sudo journalctl -n ${JOURNAL_ENTRIES} --no-pager)"
        local dm="$(dmesg -H | tail -${DMESG_ENTRIES})"
        local dkr="$(docker ps -a)"
        local kata="$(kata-runtime kata-env)"

        echo "Sanity log file [$fname]"
        echo "------------------------"
        echo "---- kata env"
        echo "${kata}"
        echo "---- dmesg tail"
        echo "${dm}"
        echo "---- Docker ps -a"
        echo "${dkr}"
        echo "---- journal tail"
        echo "${journal}"
        echo "------------------------"
}


echo "fast footprint sanity test"
echo " NUM_CONTAINERS=${NUM_CONTAINERS}"
echo " MAX_MEMORY_CONSUMED=${MAX_MEMORY_CONSUMED}"
echo " NAP_TIME=${NAP_TIME}"
echo " JOURNAL_ENTRIES=${JOURNAL_ENTRIES}"
echo " DMESG_ENTRIES=${DMESG_ENTRIES}"

echo "Launching the containers"
# launch the containers
bash ./fast_footprint.sh

#Take a status snapshot
snapshot > sanity_launched.log

echo "entering sanity check loop"
# enter the sanity poll loop
while :; do
        sleep ${NAP_TIME}
        ts=$(date -Iseconds)
        cnt=$(docker ps -q | wc -l)
        echo -n "[${ts}] of ${NUM_CONTAINERS} have ${cnt}"
        if [ $cnt == $NUM_CONTAINERS ]; then
                echo " OK"
        else
                echo " FAIL"
                snapshot > sanity_fail.log
                exit 1
        fi

done
```

# What did I see

From the logs then...

```
Wating for KSM to settle...
............................................................................................................................................................................................................................................................................................................Timed out after 300s waiting for KSM to settle
and dodge death (>> I have no idea where this comes from btw - I'll check some time ;-) <<)
entering sanity check loop
[2018-10-04T08:51:50-07:00] of 1000 have 1000 OK

... about 3hours later ? ....

[2018-10-04T12:04:23-07:00] of 1000 have 1000 OK
runtime/cgo: pthread_create failed: No space left on device
SIGABRT: abort
PC=0x7fe6f3fdb428 m=19 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7fe6f3fdb428
stack: frame={sp:0x7fe6d67fba08, fp:0x0} stack=[0x7fe6d5ffc2f0,0x7fe6d67fbef0)
00007fe6d67fb908:  00007fe6f49be168  00007fe6d67fba68
00007fe6d67fb918:  00007fe6f47a1b1f  0000000000000003
00007fe6d67fb928:  00007fe6f49ae5f8  0000000000000005
00007fe6d67fb938:  000000000249b040  00007fe6bc0008c0
00007fe6d67fb948:  00000000000000f1  0000000000000011
00007fe6d67fb958:  0000000000000000  00000000019c71ff
00007fe6d67fb968:  00007fe6f47a6ac6  0000000000000005
00007fe6d67fb978:  0000000000000000  0000000100000000
00007fe6d67fb988:  00007fe6f3facde0  00007fe6d67fbb20
00007fe6d67fb998:  00007fe6f47ae923  00000000000000ff
00007fe6d67fb9a8:  0000000000000000  0000000000000000
00007fe6d67fb9b8:  0000000000000000  2525252525252525
00007fe6d67fb9c8:  2525252525252525  0000000000000000
00007fe6d67fb9d8:  00007fe6f436b700  00000000019c71ff
00007fe6d67fb9e8:  00007fe6bc0008c0  00000000000000f1
00007fe6d67fb9f8:  0000000000000011  0000000000000000
00007fe6d67fba08: <00007fe6f3fdd02a  0000000000000020
00007fe6d67fba18:  0000000000000000  0000000000000000
00007fe6d67fba28:  0000000000000000  0000000000000000
00007fe6d67fba38:  0000000000000000  0000000000000000
00007fe6d67fba48:  0000000000000000  0000000000000000
00007fe6d67fba58:  0000000000000000  0000000000000000
00007fe6d67fba68:  0000000000000000  0000000000000000
00007fe6d67fba78:  0000000000000000  0000000000000000
00007fe6d67fba88:  0000000000000000  0000000000000000
00007fe6d67fba98:  0000000000000000  0000000000000000
00007fe6d67fbaa8:  00007fe6f401ebff  00007fe6f436b540
00007fe6d67fbab8:  0000000000000001  00007fe6f436b5c3
00007fe6d67fbac8:  00000000000000f1  0000000000000011
00007fe6d67fbad8:  00007fe6f4020409  000000000000000a
00007fe6d67fbae8:  00007fe6f409d2dd  000000000000000a
00007fe6d67fbaf8:  00007fe6f436c770  0000000000000000
runtime: unknown pc 0x7fe6f3fdb428
stack: frame={sp:0x7fe6d67fba08, fp:0x0} stack=[0x7fe6d5ffc2f0,0x7fe6d67fbef0)
00007fe6d67fb908:  00007fe6f49be168  00007fe6d67fba68

... and more stack dumps....
```

I've attached the full log as 'death.log'.
[death.log](https://github.com/kata-containers/runtime/files/2450002/death.log)

Also, if I try to use the runtime to list how many containers are still running:

```
gwhaley@clrw02:~/go/src/github.com/kata-containers/tests/metrics/density$ sudo kata-runtime list
runtime/cgo: pthread_create failed: No space left on device
Aborted (core dumped)
```

---
I've uploaded the output from the kata collect as an attachment:
[collect.log](https://github.com/kata-containers/runtime/files/2450007/collect.log)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scaling: stability: 1000 containers after apprx 3 hours shows issue #807

Description of problem

Expected result

Actual result

What did I run

What did I see

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scaling: stability: 1000 containers after apprx 3 hours shows issue #807

Description

Description of problem

Expected result

Actual result

What did I run

What did I see

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions