Skip to content

Commit 10301be

Browse files
authored
packrat ereport storage and snitch implementation (#2126)
Things fail. Not finding out about it sucks. This branch implements the Hubris side of the ereport ingestion system, as described [RFD 545]. Work on this was started by @cbiffle in #2002, which implemented the core ring-buffer data structure used to store ereports. Meanwhile, oxidecomputer/management-gateway-service#370, oxidecomputer/omicron#7803, and oxidecomputer/omicron#8296 added the MGS and Omicron components of this system. This branch picks up where Cliff left off, and "draws the rest of the owl" by implementing the aggregation of ereports in the `packrat` task using this data structure, and adding a new `snitch` task, which acts as a proxy to allow ereports stored by `packrat` to be read over the management network. ## Architecture Ereports are stored by `packrat` because we would like as many tasks as possible to be able to report errors by making IPC call to the task responsible for ereport storage. This means that the task aggregating ereports must be a high-priority task, so that as many other tasks as possible may be its clients. Additionally, we would like to include the system's VPD identity as metadata for ereports, and this data is already stored by packrat. Finally, we would like to minimize the likelihood of the task that stores ereports crashing, as this would result in data loss, and packrat already is expected not to crash. On the other hand, the task that actually evacuates these ereports over the management network must run at a priority lower than that of the `net` task, of which it is a client. Thus the separation of responsibilities between `packrat` and the `snitch`. The snitch task is fairly simple. It receives packets sent to the ereport socket, interprets the request message, and forwards the request to packrat. Any ereports sent back by packrat are sent in response to the request. The snitch ends up being a pretty dumb, stateless proxy: as the response packet is encoded by packrat; all we end up doing is taking the bytes received from packrat and stuffing them into the socket's send queue. The real purpose of this thing is just to serve as a trampoline between the high priority level of packrat and a priority level lower than that of the net task. ## `snitch-core` Fixes While testing behavior when the ereport buffer is full, I found a potential panic in the existing `snitch-core` code. Previously, every time ereports are read from the buffer while it is in the `Losing` state (i.e., ereports have been discarded because the buffer was full), `snitch-core` attempts to insert a new loss record at the end of the buffer (calling `recover_if_needed()`). This ensures that the data loss is reported to the reader ASAP. The problem is that this code assumed that there would always be space for an additional loss record, and panicked if it didn't fit. I added a test reproducing this panic in ff93754, and fixed it in 22044d1 by changing the calculation of whether recovery is possible. When `recover_if_needed` is called while in the `Losing` state, we call the `free_space()` method to determine whether we can recover. In the `Losing` state, [this method would calculate the free space by subtracting the space required for the loss record][1] that must be encoded to transition out of the `Losing` state. However, in the case where `recover_if_required()` is called with `required_space: None` (which indicates that we're not trying to recover because we want to insert a new record, but just because we want to report ongoing data loss to the caller), [we check that the free space is greater than or equal to 0][2]. This means that we would still try to insert a loss record even if the free space was 0, resulting in a panic. I've fixed this by moving the check that there's space for a loss record out of the calculation of `free_space()` and into the _required_ space, in addition to the requested value (which is 0 in the "we are inserting the loss record to report loss" case). This way, we only insert the loss record if it fits, which is the correct behavior. I've also changed the assignment of ENAs in `snitch-core` to start at 1, rather than 0, since ENA 0 is reserved in the wire protocol to indicate "no ENA". In the "committed ENA" request field this means "don't flush any ereports", and in the "start ENA" response field, ENA 0 means "no ereports in this packet". Thus, the ereport store must start assigning ENAs at ENA 1 for the initial loss record. ## Testing Currently, no tasks actually produce ereports. To test that everything works correctly, it was necessary to add a source of ereports, so I've added [a little task][3] that just generates test ereports when asked via `hiffy`. I've included some of that in [this comment][4]. This was also used for testing the data-loss behavior discussed above. [RFD 545]: https://rfd.shared.oxide.computer/rfd/0545 [1]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L110-L121 [2]: https://github.com/oxidecomputer/hubris/blob/e846b9d2481b13cf2b18a2a073bb49eef5f654de/lib/snitch-core/src/lib.rs#L297-L300 [3]: https://github.com/oxidecomputer/hubris/blob/864fa57a7c34a6225deddcffa0c7d54c3063eab6/task/ereportulator/src/main.rs
1 parent b35abba commit 10301be

File tree

31 files changed

+1843
-339
lines changed

31 files changed

+1843
-339
lines changed

Cargo.lock

Lines changed: 324 additions & 266 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@ leb128 = { version = "0.2.5", default-features = false }
9292
lpc55-pac = { version = "0.4", default-features = false }
9393
memchr = { version = "2.4", default-features = false }
9494
memoffset = { version = "0.6.5", default-features = false }
95+
minicbor = { version = "0.26.4", default-features = false }
9596
multimap = { version = "0.8.3", default-features = false }
9697
nb = { version = "1", default-features = false }
9798
num = { version = "0.4", default-features = false }
@@ -144,6 +145,7 @@ zip = { version = "0.6", default-features = false, features = ["bzip2", "deflate
144145
attest-data = { git = "https://github.com/oxidecomputer/dice-util", default-features = false, version = "0.4.0" }
145146
dice-mfg-msgs = { git = "https://github.com/oxidecomputer/dice-util", default-features = false, version = "0.2.1" }
146147
gateway-messages = { git = "https://github.com/oxidecomputer/management-gateway-service", default-features = false, features = ["smoltcp"] }
148+
gateway-ereport-messages = { git = "https://github.com/oxidecomputer/management-gateway-service", default-features = false }
147149
gimlet-inspector-protocol = { git = "https://github.com/oxidecomputer/gimlet-inspector-protocol", version = "0.1.0" }
148150
hif = { git = "https://github.com/oxidecomputer/hif", default-features = false }
149151
humpty = { git = "https://github.com/oxidecomputer/humpty", default-features = false, version = "0.1.3" }

app/cosmo/base.toml

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,10 +122,20 @@ notifications = ["i2c1-irq", "i2c2-irq", "i2c3-irq", "i2c4-irq"]
122122
[tasks.packrat]
123123
name = "task-packrat"
124124
priority = 1
125+
stacksize = 1040
125126
start = true
126127
# task-slots is explicitly empty: packrat should not send IPCs!
127128
task-slots = []
128-
features = ["cosmo"]
129+
features = ["cosmo", "ereport"]
130+
131+
[tasks.rng_driver]
132+
features = ["h753", "ereport"]
133+
name = "drv-stm32h7-rng"
134+
priority = 6
135+
uses = ["rng"]
136+
start = true
137+
stacksize = 512
138+
task-slots = ["sys", "packrat"]
129139

130140
[tasks.thermal]
131141
name = "task-thermal"
@@ -347,6 +357,17 @@ extern-regions = ["sram1", "sram2", "sram3", "sram4"]
347357
notifications = ["socket"]
348358
features = ["net", "vlan"]
349359

360+
[tasks.snitch]
361+
name = "task-snitch"
362+
# The snitch should have a priority immediately below that of the net task,
363+
# to minimize the number of components that can starve it from resources.
364+
priority = 6
365+
stacksize = 1200
366+
start = true
367+
task-slots = ["net", "packrat"]
368+
features = ["vlan"]
369+
notifications = ["socket"]
370+
350371
[tasks.spd]
351372
name = "task-cosmo-spd"
352373
priority = 7
@@ -1456,6 +1477,15 @@ port = 11113
14561477
tx = { packets = 3, bytes = 1024 }
14571478
rx = { packets = 3, bytes = 1024 }
14581479

1480+
[config.net.sockets.ereport]
1481+
kind = "udp"
1482+
owner = {name = "snitch", notification = "socket"}
1483+
port = 57005
1484+
tx = { packets = 3, bytes = 1024 }
1485+
# v0 ereport requests are always 35B, so just make the buffer exactly
1486+
# that size...
1487+
rx = { packets = 3, bytes = 35 }
1488+
14591489
[config.auxflash]
14601490
memory-size = 33_554_432 # 256 Mib / 32 MiB
14611491
slot-count = 16 # 2 MiB slots

app/demo-stm32h7-nucleo/app-h753.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,6 @@ task-slots = ["sys"]
7171
[tasks.packrat]
7272
name = "task-packrat"
7373
priority = 2
74-
max-sizes = {flash = 8192, ram = 2048}
7574
start = true
7675
# task-slots is explicitly empty: packrat should not send IPCs!
7776
task-slots = []

app/gimlet/base.toml

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,10 +123,11 @@ notifications = ["i2c1-irq", "jefe-state-change"]
123123
[tasks.packrat]
124124
name = "task-packrat"
125125
priority = 1
126+
stacksize = 1040
126127
start = true
127128
# task-slots is explicitly empty: packrat should not send IPCs!
128129
task-slots = []
129-
features = ["gimlet"]
130+
features = ["gimlet", "ereport"]
130131

131132
[tasks.thermal]
132133
name = "task-thermal"
@@ -194,6 +195,15 @@ interrupts = {"hash.irq" = "hash-irq"}
194195
task-slots = ["sys"]
195196
notifications = ["hash-irq"]
196197

198+
[tasks.rng_driver]
199+
features = ["h753", "ereport"]
200+
name = "drv-stm32h7-rng"
201+
priority = 6
202+
uses = ["rng"]
203+
start = true
204+
stacksize = 512
205+
task-slots = ["sys", "packrat"]
206+
197207
[tasks.hf]
198208
name = "drv-gimlet-hf-server"
199209
features = ["h753"]
@@ -336,6 +346,17 @@ extern-regions = ["sram1", "sram2", "sram3", "sram4"]
336346
notifications = ["socket"]
337347
features = ["net", "vlan"]
338348

349+
[tasks.snitch]
350+
name = "task-snitch"
351+
# The snitch should have a priority immediately below that of the net task,
352+
# to minimize the number of components that can starve it from resources.
353+
priority = 6
354+
stacksize = 1200
355+
start = true
356+
task-slots = ["net", "packrat"]
357+
features = ["vlan"]
358+
notifications = ["socket"]
359+
339360
[tasks.sbrmi]
340361
name = "drv-sbrmi"
341362
priority = 4
@@ -1315,3 +1336,11 @@ port = 23547
13151336
tx = { packets = 3, bytes = 1024 }
13161337
rx = { packets = 3, bytes = 512 }
13171338

1339+
[config.net.sockets.ereport]
1340+
kind = "udp"
1341+
owner = {name = "snitch", notification = "socket"}
1342+
port = 57005
1343+
tx = { packets = 3, bytes = 1024 }
1344+
# v0 ereport requests are always 35B, so just make the buffer exactly
1345+
# that size...
1346+
rx = { packets = 3, bytes = 35 }

app/gimletlet/app-ereportlet.toml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Gimletlet Ereport test application
2+
#
3+
# This image includes the `ereportulator` task, which may be used to generate
4+
# fake error reports to test the ereport aggregation and evacuation subsystem.
5+
#
6+
# Ereports may be generated using `humility hiffy` to call the
7+
# `Ereportulator.fake_ereport` IPC operation. This takes one argument, `n`,
8+
# which is an arbitrary `u32` value included in the ereport data payload. This
9+
# is intended to be used to differentiate between multiple ereports during
10+
# testing.
11+
#
12+
# For example:
13+
#
14+
# $ humility hiffy -t gimletlet hiffy -c Ereportulator.fake_ereport -a n=42
15+
#
16+
name = "gimletlet-ereportlet"
17+
inherit = "app.toml"
18+
19+
[tasks.jefe.config.allowed-callers]
20+
request_reset = ["hiffy"]
21+
22+
[tasks.hiffy]
23+
features = ["h753", "stm32h7", "i2c", "gpio", "spi"]
24+
task-slots = ["sys", "i2c_driver", "user_leds", "ereportulator"]
25+
26+
[tasks.ereportulator]
27+
name = "task-ereportulator"
28+
priority = 5
29+
start = true
30+
task-slots = ["packrat"]
31+
notifications = []

app/gimletlet/app.toml

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,12 @@ owner = {name = "sprot", notification = "rot_irq"}
4242

4343
[tasks.packrat]
4444
name = "task-packrat"
45-
priority = 3
46-
max-sizes = {flash = 8192, ram = 2048}
45+
priority = 1
4746
start = true
4847
# task-slots is explicitly empty: packrat should not send IPCs!
4948
task-slots = []
49+
stacksize = 1040
50+
features = ["ereport"]
5051

5152
[tasks.control_plane_agent]
5253
name = "task-control-plane-agent"
@@ -185,6 +186,21 @@ task-slots = ["net", "packrat"]
185186
features = ["vlan"]
186187
notifications = ["socket"]
187188

189+
[tasks.snitch]
190+
name = "task-snitch"
191+
# The snitch should have a priority immediately below that of the net task,
192+
# to minimize the number of components that can starve it from resources.
193+
priority = 4
194+
stacksize = 1200
195+
start = true
196+
task-slots = ["net", "packrat"]
197+
features = ["vlan"]
198+
notifications = ["socket"]
199+
200+
[tasks.rng_driver]
201+
features = ["h753", "ereport"]
202+
task-slots = ["sys", "user_leds", "packrat"]
203+
188204
# VLAN configuration
189205
[config.net.vlans.sidecar1]
190206
vid = 0x301
@@ -233,6 +249,15 @@ port = 11113
233249
tx = { packets = 3, bytes = 1024 }
234250
rx = { packets = 3, bytes = 1024 }
235251

252+
[config.net.sockets.ereport]
253+
kind = "udp"
254+
owner = {name = "snitch", notification = "socket"}
255+
port = 57005
256+
tx = { packets = 3, bytes = 1024 }
257+
# v0 ereport requests are always 35B, so just make the buffer exactly
258+
# that size...
259+
rx = { packets = 3, bytes = 35 }
260+
236261
[tasks.sprot]
237262
name = "drv-stm32h7-sprot-server"
238263
priority = 5

app/grapefruit/app-dev.toml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,43 @@ inherit = "base.toml"
66
features = ["uart8"]
77
uses = ["uart8"]
88
interrupts = {"uart8.irq" = "usart-irq"}
9+
10+
# Ereport stuff
11+
[tasks.packrat]
12+
stacksize = 1040
13+
features = ["ereport"]
14+
15+
[tasks.snitch]
16+
name = "task-snitch"
17+
priority = 4
18+
stacksize = 1200
19+
start = true
20+
task-slots = ["net", "packrat"]
21+
features = ["vlan"]
22+
notifications = ["socket"]
23+
24+
[tasks.rng_driver]
25+
features = ["h753", "ereport"]
26+
name = "drv-stm32h7-rng"
27+
priority = 6
28+
uses = ["rng"]
29+
start = true
30+
stacksize = 512
31+
task-slots = ["sys", "packrat"]
32+
33+
# Demo/test task for ereports
34+
[tasks.ereportulator]
35+
name = "task-ereportulator"
36+
priority = 6
37+
start = true
38+
task-slots = ["packrat"]
39+
notifications = []
40+
41+
[config.net.sockets.ereport]
42+
kind = "udp"
43+
owner = {name = "snitch", notification = "socket"}
44+
port = 57005
45+
tx = { packets = 3, bytes = 1024 }
46+
# v0 ereport requests are always 35B, so just make the buffer exactly
47+
# that size...
48+
rx = { packets = 3, bytes = 35 }

app/grapefruit/base.toml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,8 +108,7 @@ notifications = ["timer"]
108108

109109
[tasks.packrat]
110110
name = "task-packrat"
111-
priority = 3
112-
max-sizes = {flash = 8192, ram = 2048}
111+
priority = 1
113112
start = true
114113
# task-slots is explicitly empty: packrat should not send IPCs!
115114
task-slots = []

app/psc/base.toml

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,14 @@ owner = {name = "sequencer", notification = "psu_pwr_ok_6"}
9090

9191

9292

93+
[tasks.rng_driver]
94+
features = ["h753", "ereport"]
95+
name = "drv-stm32h7-rng"
96+
priority = 6
97+
uses = ["rng"]
98+
start = true
99+
stacksize = 512
100+
task-slots = ["sys", "packrat"]
93101

94102
[tasks.i2c_driver]
95103
name = "drv-stm32xx-i2c-server"
@@ -109,11 +117,12 @@ notifications = ["i2c2-irq", "i2c3-irq"]
109117

110118
[tasks.packrat]
111119
name = "task-packrat"
112-
priority = 3
113-
max-sizes = {flash = 8192, ram = 2048}
120+
priority = 1
121+
stacksize = 1040
114122
start = true
115123
# task-slots is explicitly empty: packrat should not send IPCs!
116124
task-slots = []
125+
features = ["ereport"]
117126

118127
[tasks.sequencer]
119128
name = "drv-psc-seq-server"
@@ -313,6 +322,17 @@ extern-regions = [ "sram1", "sram2", "sram3", "sram4" ]
313322
notifications = ["socket"]
314323
features = ["net", "vlan"]
315324

325+
[tasks.snitch]
326+
name = "task-snitch"
327+
# The snitch should have a priority immediately below that of the net task,
328+
# to minimize the number of components that can starve it from resources.
329+
priority = 5
330+
stacksize = 1200
331+
start = true
332+
task-slots = ["net", "packrat"]
333+
features = ["vlan"]
334+
notifications = ["socket"]
335+
316336
[tasks.idle]
317337
name = "task-idle"
318338
priority = 7
@@ -535,3 +555,12 @@ owner = {name = "dump_agent", notification = "socket"}
535555
port = 11113
536556
tx = { packets = 3, bytes = 1024 }
537557
rx = { packets = 3, bytes = 1024 }
558+
559+
[config.net.sockets.ereport]
560+
kind = "udp"
561+
owner = {name = "snitch", notification = "socket"}
562+
port = 57005
563+
tx = { packets = 3, bytes = 1024 }
564+
# v0 ereport requests are always 35B, so just make the buffer exactly
565+
# that size...
566+
rx = { packets = 3, bytes = 35 }

0 commit comments

Comments
 (0)