[ACUWT] Add indexes to article_course_user_wiki_timeslices#6888
Merged
gabina merged 4 commits intoJun 5, 2026
Conversation
`db/migrate/20260602000001_add_indexes_to_article_course_user_wiki_timeslices.rb`: Adds three indexes to the `article_course_user_wiki_timeslices` table (ACUWT): - **Unique index** on `(course_id, article_id, user_id, wiki_id, start, end)`: Enforces data integrity — mirrors the pattern of sibling tables (`course_user_wiki_timeslices` has a unique index on the equivalent 5-column key). Also directly speeds up the `find_or_create_by` call in `update_article_course_user_wiki_timeslices`, which fires for every revision processed. - **`(course_id, wiki_id)`**: Covers deletion queries in `TimesliceCleaner` (`delete_existing_article_course_user_wiki_timeslices`) and serves as a fast prefix for the many queries that filter by course and wiki before applying further conditions. The unique index's leading prefix already covers this access pattern partially, but a dedicated 2-column index avoids walking the wider B-tree for these lightweight lookups. - **`(course_id, user_id)`**: Covers deletion and cleanup queries in `UpdateTimeslicesCourseUser` and `TimesliceCleaner` that filter only by course and user, without wiki or time constraints. The choice of these three indexes (rather than the originally proposed `(course_id, article_id)`, `(course_id, wiki_id)`, `(course_id, user_id)`) was informed by analysis of actual query patterns across ACT, CUWT, and CWT population paths. The user asked for the migration after confirming through benchmarks that the ACUWT path is faster than the legacy path. The initial proposal from the user was three 2-column composite indexes. Before writing the migration, the actual query patterns were analyzed across all callers of ACUWT — models, services, and lib — to evaluate whether wider indexes would be more selective. That analysis suggested extending each index to include `wiki_id` and `start`, but the user chose to go with the simpler 2-column non-unique indexes plus the unique 6-column index for data integrity instead. The debugger commit at HEAD was reset (mixed) before committing to keep it as local working state without including it in the branch. Session: ~10 user messages, mostly short questions and confirmations. No test runs for this commit (it's a migration-only change). The disk-space tradeoff analysis was done analytically from column widths and estimated row counts rather than from production data. (Commit message written by Claude Code.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9c5879c to
66fa0f2
Compare
## Changes `app/models/article_course_user_wiki_timeslice.rb`: Adds `bulk_upsert_from_revisions` class method as a faster replacement for the per-(article, user) `find_or_create_by + save` loop. For a timeslice with N unique (article, user) pairs, the old path issued ~3N DB round-trips; the new path issues 1 query (one `upsert_all`). On duplicate key, only the stats columns are updated; the unique key columns and `tracked`/`created_at` are left untouched. Two private helpers split the computation to stay within RuboCop ABC and method-length limits: - `acuwt_records_from_revisions` — groups revisions, builds attribute hashes - `acuwt_revision_stats` — computes per-group stats (character_sum, references, new_article, first_revision, wikidata stats) - `acuwt_wikidata_stats` — wikidata-only wrapper around `UpdateWikidataStatsTimeslice#build_stats_from_revisions` `upsert_all` is called without `unique_by:` because MySQL's `INSERT ... ON DUPLICATE KEY UPDATE` resolves conflicts via all unique constraints automatically; specifying `unique_by:` raises `Mysql2Adapter does not support :unique_by`. `app/services/update_course_wiki_timeslices.rb`: `update_article_course_user_wiki_timeslices_for_wiki` now does one `acuwt_timeslice_for` lookup (already used by the downstream ACT/CUWT/CWT methods) and delegates to `bulk_upsert_from_revisions`. The per-group loop and the repeated CWT timeslice lookup (previously N calls, one per group) are gone. `app/services/update_timeslices_course_user.rb`: `create_acuwt_records_for_timeslice` collapsed from a per-group loop to a single `bulk_upsert_from_revisions` call — the timeslice boundaries are already known from the `cwt` argument, so no lookup is needed. ## Process The user asked to improve the `acuwt_updated` step, which accounted for 4.2 minutes (16% of total) in a full course update benchmark. Analysis of the write path identified the per-row `find_or_create_by + save` loop as the bottleneck. Claude Code proposed `upsert_all` and handled the RuboCop ABC split. First attempt included `unique_by:` which raised `Mysql2Adapter does not support :unique_by` on MySQL; removing it fixed the issue. Benchmarks after the fix confirmed `acuwt_updated` dropped from 4.2m to 0.3m (16× speedup), reducing the full update from 42.7m to 35.6m. Session: extended back-and-forth (~30 user messages) covering index design, disk-space analysis, benchmark interpretation, and the upsert_all implementation. User provided benchmark logs and error messages; direction was terse (a few words to a sentence per message). One failure before green: the `unique_by:` MySQL error caught during a live test run. (Commit message written by Claude Code.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`app/models/article_course_timeslice.rb`: Adds `bulk_update_from_acuwt` class method as a faster replacement for the per-article `update_from_acuwt` loop. The old path issued ~9 DB round-trips per article (find_or_create_by + 6 aggregate queries + save); the new path loads all ACUWT records for the timeslice in one SELECT, aggregates stats in Ruby, and writes all ACT rows in one `upsert_all`. Two private helpers split the computation to stay within RuboCop limits: - `act_records_from_acuwt` — fetches and groups ACUWT by article_id, builds attribute hashes - `act_stats_from_acuwt` — computes per-article stats (revision_count, character_sum, references_count, user_ids, new_article, first_revision) `user_ids` is `serialize :user_ids, type: Array`; Rails 8.1's `upsert_all` handles the YAML serialization automatically through the attribute type system. `unique_by:` is omitted (MySQL does not support it — existing unique index on `(article_id, course_id, start, end)` handles conflict resolution via `ON DUPLICATE KEY UPDATE`). `app/services/update_course_wiki_timeslices.rb`: - `update_article_course_timeslices_from_acuwt_for_wiki`: replaces the per-article loop (N calls to `update_from_acuwt`) with one `bulk_update_from_acuwt` call. - `reaggregate_timeslice_from_acuwt`: same replacement for the ACT portion. The pre-fetched `acuwt` relation is removed; CUWT still uses a direct `pluck(:user_id)` query since that path has not been bulk-optimized yet. The user asked whether `act_updated` could be made faster after observing it at 5.6m in the ACUWT path vs 3.9m in the default path. Claude Code identified that the same per-record pattern fixed for `acuwt_updated` (via upsert_all) applied here too. Implementation was straightforward — one RuboCop offense (`> 0` → `.positive?`) was caught and auto-corrected. Benchmarks confirmed `act_updated` dropped from 5.6m to 0.1m (56×), bringing the ACUWT full update below the default path total (29.8m vs 33.9m). The reaggregation path also benefited: 20-CWT reaggregation dropped from 3.3m to 0.1m for add-user, and from 2.2m to 0.1m for remove-user, because each CWT's ACT step now does a single bulk upsert instead of N×9 queries. Session: extended (~50 user messages), following directly from the upsert_all-for-ACUWT work. Terse user direction (a few words per message). One RuboCop failure; green on first spec run. Benchmarks were run externally by the user and shared as log JSON. (Commit message written by Claude Code.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Changes `app/models/article_course_timeslice.rb`: Removes `update_from_acuwt` class method and `update_cache_from_acuwt` instance method, both superseded by `bulk_update_from_acuwt`. Neither is called anywhere in the hot path after the previous commit. `spec/models/article_course_timeslice_spec.rb`: Replaces the two dead describe blocks (`.update_from_acuwt` and `#update_cache_from_acuwt`) with a single `.bulk_update_from_acuwt` block. The fixture data is consolidated — user3 (revision_count: 0) is included in the shared before block so the user_ids exclusion behavior is tested alongside the aggregate field checks. `spec/services/update_course_wiki_timeslices_spec.rb`: Updates the allow/expect stub for `ArticleCourseTimeslice` from `update_from_acuwt(course, article_id, wiki, ...)` to `bulk_update_from_acuwt(course, wiki, ...)` to match the new call signature. ## Process Straightforward cleanup following the bulk_update_from_acuwt commit. Spec changes were immediate — one spec run, 30 examples, 0 failures. (Commit message written by Claude Code.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d97c5be
into
WikiEducationFoundation:article-course-user-wiki-timeslices
1 check passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Adds three indexes to
article_course_user_wiki_timeslicesand implementsbulk
upsert_allwrites for ACUWT and ACT rows.(course_id, article_id, user_id, wiki_id, start, end)for data integrity, plus
(course_id, wiki_id)and(course_id, user_id)fordeletion and lookup queries.
bulk_upsert_from_revisions): replaces the per-(article, user)find_or_create_by + saveloop with a singleupsert_all, reducing write overheadfrom N×3 queries to 2 per timeslice window.
bulk_update_from_acuwt): replaces the per-articlefind_or_create_by + N aggregate queries + saveloop with a single SELECT andupsert_all, reducing N×9 queries to 2 per timeslice window.Benchmarks
Small article scoped program on wikidata: Latinoamérica en Wikidata 2024 - Bolivia.
AI usage
Claude Code (Sonnet 4.6) was used to analyze query patterns across all callers
of the table to inform index choices, draft the bulk write implementations, and
write commit messages. The human directed all decisions on what to build and ran
all benchmarks.
This PR description was drafted using the
/prepare-prClaude Code skill.Screenshots
No UI changes.
Open questions and concerns
The
use_acuwt?flag must currently be set per-course viaflags[:use_acuwt]— there is no admin UI to toggle it. Enabling it for production courses requires
a console or direct DB update.
The CUWT reaggregation step still uses a per-user loop rather than
upsert_all.A bulk write optimization there would be a natural follow-up.