build: Harden flaky Aeron tests in CI#32242
Conversation
| -Dakka.cluster.assert=on \ | ||
| -Daeron.dir=/opt/volumes/media-driver \ | ||
| -Daeron.term.buffer.length=33554432 \ | ||
| clean ${{ matrix.command }} |
There was a problem hiding this comment.
This job is not in Kubernetes. Might have same problem with too small /dev/shm. Let me try...
There was a problem hiding this comment.
Plenty of space, no problem.
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 62G 22G 74% /
tmpfs 7.9G 172K 7.9G 1% /dev/shm
tmpfs 3.2G 1.1M 3.2G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/sdb15 105M 6.1M 99M 6% /boot/efi
/dev/sda1 63G 4.1G 56G 7% /mnt
tmpfs 1.6G 12K 1.6G 1% /run/user/1001
.github/workflows/multi-node.yml
Outdated
| gcloud config set compute/zone us-central1-c | ||
| ./kubernetes/create-cluster-gke.sh "akka-artery-aeron-cluster-${GITHUB_RUN_ID}" | ||
| gcloud container clusters get-credentials akka-artery-aeron-cluster-test --zone us-central1-c --project akka-team | ||
| # ./kubernetes/create-cluster-gke.sh "akka-artery-aeron-cluster-test" |
There was a problem hiding this comment.
is this intentional? Not calling the script to create the cluster?
There was a problem hiding this comment.
leftover from my testing, thanks
77334b7 to
a50c9e5
Compare
* increase /dev/shm and use that (by default) * use default term buffer size * increase cpu requests, shouldn't matter but corresponds to what we want to use, 2 pods per node
a50c9e5 to
bf1d4f0
Compare
* more memory request * separate Aeron run in another workflow to make such test failures more clear
|
There was an error: "insuffiient usable storage for new log of ". I have increased it. I don't know if it accumulates when running all tests? It's supposed to delete the files on shutdown. |
|
I separated the aeron run in separate workflow. I hope that shows up so I can trigger a manual run if I merge this? |
|
a successful run in https://github.com/akka/akka/actions/runs/7015560435 |
* increase /dev/shm and use that (by default) * use default term buffer size * increase cpu requests, shouldn't matter but corresponds to what we want to use, 2 pods per node * more memory request * separate Aeron run in another workflow to make such test failures more clear
This looks very promising. I have tried in a gke cluster. Verified with
df -h. It was 64 MB and now 1G.No more "Scheduled sending of heartbeat was delayed".
This wasn't possible when we tried last time #30601