Skip to content

Conversation

@heywji
Copy link
Contributor

@heywji heywji commented Jul 8, 2025

This test case wants to add some checkpoints to verify the long initialization time of the NETKVM device driver (NDIS miniport).

ID: 2267

Signed-off-by: wji [email protected]

Summary by CodeRabbit

  • New Features

    • Introduced NetKVM FastInit validation for Windows VMs, including enable/disable FastInit per NIC, boot-time measurements, and threshold checks. Supports Driver Verifier toggling and WMI-based reporting.
  • Tests

    • Added comprehensive test coverage with variants for NIC counts, multiqueue settings, and queue sizes. Automates cold/hot boot flows, per-NIC FastInit synchronization, and metrics parsing. Provides detailed diagnostics and failure reporting when thresholds are exceeded.

@heywji heywji force-pushed the add_fastinit_case2 branch 3 times, most recently from 98e220a to 7001b6b Compare July 8, 2025 07:20
@heywji
Copy link
Contributor Author

heywji commented Jul 8, 2025

Related to: #4175

@heywji
Copy link
Contributor Author

heywji commented Jul 8, 2025

Depends on: avocado-framework/avocado-vt#4160

@heywji heywji force-pushed the add_fastinit_case2 branch from 7001b6b to 5f2aed4 Compare July 14, 2025 02:32
@heywji heywji force-pushed the add_fastinit_case2 branch from 5f2aed4 to 6b065a6 Compare July 24, 2025 08:19
This test case wants to add some checkpoints to verify long
initialization time of NETKVM device driver (NDIS miniport).

Signed-off-by: Wenkang Ji <[email protected]>
@coderabbitai
Copy link

coderabbitai bot commented Sep 29, 2025

Walkthrough

Adds a new Windows-focused NetKVM FastInit test: introduces a qemu test config (netkvm_fastinit.cfg) defining variants, NIC counts/queues, thresholds, and verifier/WMI settings; and a Python test (netkvm_fastinit.py) executing cold boot, optional Driver Verifier toggle, per-NIC FastInit configuration, WMI log parsing, and timing/threshold validation.

Changes

Cohort / File(s) Summary of Changes
Test Configuration
qemu/tests/cfg/netkvm_fastinit.cfg
New test config for Windows NetKVM FastInit. Sets host/type metadata, cdroms, backup/restore, timeouts, WMI path, nics=27, restricts to virtio_net, counts NICs via command, defines FastInit value wiring, and provides variant families for queue sizes (1x256, 4x1024) with per-variant thresholds and fastinit on/off sub-variants.
Functional Test
qemu/tests/netkvm_fastinit.py
New test implementing VM cold boot, optional Driver Verifier disable/enable with reboot, per-NIC FastInit check/set (primarily first NIC with propagation if mismatched), WMI dump and log parsing to validate InitTimeMs/LazyAllocTimeMs against thresholds, NIC Active count checks, and extensive logging/error handling. Adds run(test, params, env).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Tester as Avocado Test
  participant VM as Guest VM
  participant GuestOS as Windows (Guest)
  participant WMI as NetKVM WMI
  participant DV as Driver Verifier

  Tester->>VM: Cold boot (create/start)
  VM-->>Tester: Boot complete

  alt Driver Verifier toggle configured
    Tester->>GuestOS: Query DV state
    GuestOS->>DV: Enable/Disable as needed
    DV-->>GuestOS: Applied
    Tester->>VM: Reboot
    VM-->>Tester: Up with DV state verified
  end

  Tester->>GuestOS: Inspect NIC FastInit on NIC0
  alt Mismatch to desired FastInit
    GuestOS->>GuestOS: Set FastInit on NIC0 (and propagate to others if needed)
  end

  Tester->>WMI: Dump NetKVM log/config
  WMI-->>Tester: InitTimeMs, LazyAllocTimeMs, Active NICs

  Tester->>Tester: Validate thresholds and counts
  alt Violations found
    Tester-->>Tester: Fail test with error_context
  else Success
    Tester-->>Tester: Log metrics and pass
  end

  Tester->>VM: Optional hot reboot and re-validate (same steps)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I toggled FastInit with a twitch of my ear,
Counted 27 NICs—oh dear, oh dear!
WMI whispers times in milliseconds neat,
Queues align, thresholds meet.
I thump the log—no flakes, all right!
Burrowed through bytes, the metrics are tight. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “netkvm: add fastinit regarding lazy memory allocation” concisely indicates the primary change of adding FastInit support for NetKVM’s lazy memory allocation and aligns with the modifications to the qemu test configurations and Python test logic described in the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
qemu/tests/cfg/netkvm_fastinit.cfg (1)

21-46: Ensure queue size parameters align with expected RX/TX symmetry.

For queue4_size1024 you set nics_param to rx_queue_size=1024,tx_queue_size=256, while nic_extra_params_nic1 mirrors the same asymmetry. If the intention is a symmetric 1024×1024 queue, adjust the TX size; otherwise, document why TX stays at 256 so future updates do not “fix” it inadvertently.

qemu/tests/netkvm_fastinit.py (3)

92-135: Track multiple lazy allocation breaches deterministically.

lazy_failures triggers test.error only after the second breach, but the loop keeps running, resulting in redundant errors for the same NIC on subsequent iterations. Collect offending NICs and emit a single consolidated failure once the loop completes to keep logs actionable.


197-201: Stop duplicating NIC extra parameters for nic1.

params["nic_extra_params_" + str(nics)] = nics_param overwrites the value for the first NIC each time through the loop. Use the loop index (nic%s already includes the number) to assign per-NIC parameters and leave nic_extra_params_nic1 untouched.


223-237: Log boot timing deltas with explicit units.

The three time.strftime calls show absolute timestamps, but the derived duration prints as HH:MM:SS without context. Append “(cold-boot duration)” or similar so automation parsing the log can differentiate between timestamps and elapsed time.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4fface and 6426d5a.

📒 Files selected for processing (2)
  • qemu/tests/cfg/netkvm_fastinit.cfg (1 hunks)
  • qemu/tests/netkvm_fastinit.py (1 hunks)

Comment on lines +14 to +17
nics_num = 27
start_vm = no
nics_num_checking_cmd = 'ipconfig | find /c "Ethernet adapter"'
fastinit_name = "FastInit"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the NIC count check resilient to localized guests.

Hard-coding 'ipconfig | find /c "Ethernet adapter"' will undercount adapters on non-English Windows builds and on systems where the interface name differs (e.g., “Carte Ethernet”). If this count gates the test, consider querying WMI (wmic nic where "Name like 'Red Hat VirtIO%'" get Name) or PowerShell (Get-NetAdapter) instead so the validation works across locales.

🤖 Prompt for AI Agents
In qemu/tests/cfg/netkvm_fastinit.cfg around lines 14 to 17, the
nics_num_checking_cmd currently uses a hard-coded, English-only ipconfig string
which will undercount adapters on localized Windows guests; replace this with a
locale-independent query (e.g., use WMI to filter by adapter Name matching the
VirtIO driver or use PowerShell Get-NetAdapter to enumerate adapters and count
them), update nics_num_checking_cmd to run that command and return the adapter
count, and optionally add a fallback path so the test still works if
PowerShell/WMI isn't available.

Comment on lines +39 to +76
enable_verifier = bool(params.get_numeric("enable_verifier", 1))
query_cmd = params.get("driver_verifier_query", "verifier /querysettings")
enable_cmd = params.get(
"driver_verifier_enable", "verifier /standard /flags netkvm.sys ndis.sys"
)
reset_cmd = params.get("driver_verifier_reset", "verifier /reset")

def verifier_is_on(session, query_cmd=query_cmd):
"""
Return True if Driver Verifier is currently enabled

Driver Verifier is considered *OFF* when the mask is
**0x00000000**; otherwise it is *ON*.
"""
output = session.cmd_output(query_cmd)
return "0x00000000" not in output

session = vm.wait_for_serial_login(timeout)
active = verifier_is_on(session)
if enable_verifier is not active:
if enable_verifier is True:
cmd = enable_cmd
else:
cmd = reset_cmd
session.cmd_status_output(cmd)
vm.reboot(method="shell", serial=True, timeout=timeout, session=session)
else:
return

session = vm.wait_for_serial_login(timeout)
test.log.info("Current Driver Verifier: %s", session.cmd_output(query_cmd))
active = verifier_is_on(session)
if enable_verifier is not active:
test.error("Driver Verifier state MISmatch after reboot")
else:
test.log.info("Driver Verifier state MATCH after reboot")
session.close()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Refresh the verifier status helper to avoid stale sessions.

verifier_is_on() calls session.cmd_output() without retry/backoff, but you close the session at the end of disable_driver_verifier. If the guest reboots slower than expected, you risk using a session tied to the pre-reboot shell. Reacquire a fresh session (or reopen via vm.wait_for_serial_login) before checking the status post-reboot to avoid false mismatches.

Comment on lines +85 to +141
lines = log_text.strip().splitlines()
active_count = 0
lazy_failures = 0
expected_nics_count = params.get_numeric("nics_num", 27)
lazy_alloc_threshold = params.get_numeric("lazy_alloc_threshold", 0)
init_alloc_threshold = params.get_numeric("init_alloc_threshold", 0)

for i, ln in enumerate(lines):
if "Active=TRUE" in ln:
active_count += 1
start = i
end = min(len(lines), i + 11)
block = "\n".join(lines[start:end])

name_match = re.search(r"InstanceName=([^\n]+)", block)
name = (name_match or [None, "UNKNOWN"])[1].strip()
init = int((re.search(r"InitTimeMs=(-?\d+)", block) or [None, "0"])[1])
lazy = int(
(re.search(r"LazyAllocTimeMs=(-?\d+)", block) or [None, "0"])[1]
)
total = init + lazy

if lazy == -1:
test.error(
"test.fail: %s | LazyAllocTimeMs = -1 "
"(should not happen after wait)" % name
)
elif init > init_alloc_threshold:
test.error(
"test.fail: %s | InitTimeMs(%s) > %d"
% (name, init, init_alloc_threshold)
)
elif lazy > lazy_alloc_threshold:
lazy_failures += 1
if lazy_failures > 1:
test.error(
"test.fail: %s | LazyAllocTimeMs(%s) > %d"
% (name, lazy, lazy_alloc_threshold)
)
else:
test.log.warning(
"Warning (allowed): %s | LazyAllocTimeMs(%s) > %d",
name,
lazy,
lazy_alloc_threshold,
)
else:
test.log.info(
"[OK] %s | Init=%s Lazy=%s Sum=%s", name, init, lazy, total
)

# Verify the count of Active=TRUE NICs matches expected
if active_count != expected_nics_count:
test.error(
"Expected %s NICs with Active=TRUE, but found %s"
% (expected_nics_count, active_count)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle non-numeric timing fields defensively.

int(re.search(...)[1]) will raise when the WMI output emits blank strings or “N/A” (observed on older virtio builds). Wrap the parse in a try/except and surface a clear TestError pointing to the offending NIC so the test fails gracefully instead of crashing.

🤖 Prompt for AI Agents
In qemu/tests/netkvm_fastinit.py around lines 85 to 141, the direct int(...)
conversions for InitTimeMs and LazyAllocTimeMs will raise on blank/"N/A" WMI
output; wrap parsing in a small try/except that validates the matched string is
numeric before converting, and if conversion fails raise a clear
test.error/TestError that names the offending NIC and includes the raw value.
Compute total only after successful parses, and keep the existing threshold
logic; the except should show which field (InitTimeMs/LazyAllocTimeMs), the raw
value, and the NIC name so the test fails gracefully instead of crashing.

Comment on lines +162 to +192
def fastinit_nics_operations(session, vm, params, test, timeout):
"""
Check first NIC FastInit status only.

:param session: VM session info
:param vm: QEMU test object
:param params: Dictionary with the test parameters
:param test: QEMU test object
:param timeout: VM login time value
"""
error_context.context("Checking first NIC FastInit status...", test.log.info)
fastinit_value = params.get_numeric("fastinit_value", 1)
first_nic_num = 0 # Only check first NIC to save time

# Only verify the first NIC's fastinit status to save time
current_value = utils_net.get_netkvm_param_value(
vm, fastinit_name, nic_index=first_nic_num
)
test.log.info("NIC %d current FastInit value: %s", first_nic_num, current_value)

if str(current_value).strip() != str(fastinit_value):
for nic_num in range(0, nics_num - 1):
utils_net.set_netkvm_param_value(
vm, fastinit_name, fastinit_value, nic_index=nic_num
)
output = utils_net.get_netkvm_param_value(
vm, fastinit_name, nic_index=nic_num
)
test.log.info("NIC %d new FastInit value is %s", nic_num, output)

login_timeout = params.get_numeric("login_timeout", 3600)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid reusing the same session after toggling FastInit.

utils_net.set_netkvm_param_value() can reboot the guest when Driver Verifier is active. Because you do not re-login after each call, later commands may act on a dead session. Acquire a fresh session inside the loop (or exit early if a reboot is detected) to keep operations robust.

🤖 Prompt for AI Agents
In qemu/tests/netkvm_fastinit.py around lines 162 to 192, the loop toggles
FastInit with utils_net.set_netkvm_param_value() which can trigger a guest
reboot (e.g. when Driver Verifier is active) but the code continues to reuse the
original session; update the code to obtain a fresh session after each set
operation (or detect a reboot and re-login) before calling any further guest
commands: after calling set_netkvm_param_value() call the appropriate VM
login/wait method to re-establish session (or break/return if reboot is
detected) and replace the session variable used for subsequent
get_netkvm_param_value() and logging so you never operate on a dead session.

Comment on lines +197 to +201
for i in range(2, nics_num + 1):
nics = "nic%s" % i
params["nics"] = " ".join([params["nics"], nics])
params["nic_extra_params_" + str(nics)] = nics_param
params["start_vm"] = "yes"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against missing nics baseline before concatenation.

If params["nics"] is empty or undefined, the " ".join(...) call raises a KeyError. Initialize params.setdefault("nics", "nic1") (or similar) before the loop so the test config doesn’t have to predefine it.

🤖 Prompt for AI Agents
In qemu/tests/netkvm_fastinit.py around lines 197 to 201, the loop concatenates
onto params["nics"] without ensuring it exists; this can raise a KeyError if
params lacks a baseline. Before the loop, ensure a baseline by calling
params.setdefault("nics", "nic1") (or another sensible default) so subsequent "
".join operations are safe, and keep adding per-NIC extra params as currently
implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant