[ACUWT] Add indexes to article_course_user_wiki_timeslices by gabina · Pull Request #6888 · WikiEducationFoundation/WikiEduDashboard

gabina · 2026-06-04T13:57:47Z

What this PR does

Adds three indexes to article_course_user_wiki_timeslices and implements
bulk upsert_all writes for ACUWT and ACT rows.

Indexes: unique index on (course_id, article_id, user_id, wiki_id, start, end)
for data integrity, plus (course_id, wiki_id) and (course_id, user_id) for
deletion and lookup queries.
ACUWT bulk write (bulk_upsert_from_revisions): replaces the per-(article, user)
find_or_create_by + save loop with a single upsert_all, reducing write overhead
from N×3 queries to 2 per timeslice window.
ACT bulk write (bulk_update_from_acuwt): replaces the per-article
find_or_create_by + N aggregate queries + save loop with a single SELECT and
upsert_all, reducing N×9 queries to 2 per timeslice window.

Benchmarks

Small article scoped program on wikidata: Latinoamérica en Wikidata 2024 - Bolivia.

Step	ACUWT path (s)	default path (s)	difference (s)
wikidata_stats_fetched	1708.7	1738.8
revisions_fetched	31.9	32.9
acuwt_updated	14.5	0	14.5
scores_fetched	10.8	11
cuwt_updated	4.3	4.2
uploads_imported	4	3.9
act_updated	3.7	234.6	-230.9
articles_courses_updated	3.6	3.6
cwt_updated	2.3	3.6
timeslices_recreated	1.3	1.4
courses_users_updated	0.2	0.2
course_cache_updated	0.1	0.1
article_namespaces_updated	0.1	0
wikidata_stats_updated	0	0
average_pageviews_updated	0	0
categories_updated	0	0
reaggregation	0	0
wiki_namespace_stats_updated	0	0
timeslices_processed_3	0	0
Total	1785.5	2034.3	-248.8

Step	ACUWT path (s)	default path (s)	difference (s)
wikidata_stats_fetched	45.5	949.2	-903.7
revisions_fetched	0.8	19	-18.2
acuwt_updated	0.3	0	0.3
scores_fetched	0	6.7	-6.7
cuwt_updated	0.1	2.2	-2.1
uploads_imported	2.1	2.1
act_updated	0.1	42.8	-42.7
articles_courses_updated	2.7	0	2.7
cwt_updated	0	2.2	-2.2
timeslices_recreated	0	0
courses_users_updated	0.1	0.1
course_cache_updated	0.1	0.1
article_namespaces_updated	0	0
wikidata_stats_updated	0	0
average_pageviews_updated	26.5	27.1
categories_updated	0	0
reaggregation	5	0	5
wiki_namespace_stats_updated	0	0
timeslices_processed_3	0	0
removed_user_cwt_marked	0.1	0.1
Total	83.4	1051.6	-968.2

Step	ACUWT path (s)	default path (s)	difference (s)
new_user_revisions_fetched	394		394
wikidata_stats_fetched	70.9	1337	-1266.1
revisions_fetched	1	22.6	-21.6
acuwt_updated	0.4		0.4
scores_fetched	0	6.8	-6.8
cuwt_updated	0.1	1.9	-1.8
uploads_imported	3.8	3.9
act_updated	0.1	84.9	-84.8
articles_courses_updated	3	0.1	2.9
cwt_updated	0.1	2.5	-2.4
timeslices_recreated	0
courses_users_updated	0.2	0.2
course_cache_updated	0.1	0.1
article_namespaces_updated	0
wikidata_stats_updated	0
average_pageviews_updated	12.7	22.2
categories_updated	0
reaggregation	6.1		6.1
wiki_namespace_stats_updated	0
timeslices_processed_3	0
new_user_acuwt_written	3
timeslices_course_user_updated	0	1.7	-1.7
Total	101.5	1483.9	-1382.4

AI usage

Claude Code (Sonnet 4.6) was used to analyze query patterns across all callers
of the table to inform index choices, draft the bulk write implementations, and
write commit messages. The human directed all decisions on what to build and ran
all benchmarks.

This PR description was drafted using the /prepare-pr Claude Code skill.

Screenshots

No UI changes.

Open questions and concerns

The use_acuwt? flag must currently be set per-course via flags[:use_acuwt]
— there is no admin UI to toggle it. Enabling it for production courses requires
a console or direct DB update.

The CUWT reaggregation step still uses a per-user loop rather than upsert_all.
A bulk write optimization there would be a natural follow-up.

`db/migrate/20260602000001_add_indexes_to_article_course_user_wiki_timeslices.rb`: Adds three indexes to the `article_course_user_wiki_timeslices` table (ACUWT): - **Unique index** on `(course_id, article_id, user_id, wiki_id, start, end)`: Enforces data integrity — mirrors the pattern of sibling tables (`course_user_wiki_timeslices` has a unique index on the equivalent 5-column key). Also directly speeds up the `find_or_create_by` call in `update_article_course_user_wiki_timeslices`, which fires for every revision processed. - **`(course_id, wiki_id)`**: Covers deletion queries in `TimesliceCleaner` (`delete_existing_article_course_user_wiki_timeslices`) and serves as a fast prefix for the many queries that filter by course and wiki before applying further conditions. The unique index's leading prefix already covers this access pattern partially, but a dedicated 2-column index avoids walking the wider B-tree for these lightweight lookups. - **`(course_id, user_id)`**: Covers deletion and cleanup queries in `UpdateTimeslicesCourseUser` and `TimesliceCleaner` that filter only by course and user, without wiki or time constraints. The choice of these three indexes (rather than the originally proposed `(course_id, article_id)`, `(course_id, wiki_id)`, `(course_id, user_id)`) was informed by analysis of actual query patterns across ACT, CUWT, and CWT population paths. The user asked for the migration after confirming through benchmarks that the ACUWT path is faster than the legacy path. The initial proposal from the user was three 2-column composite indexes. Before writing the migration, the actual query patterns were analyzed across all callers of ACUWT — models, services, and lib — to evaluate whether wider indexes would be more selective. That analysis suggested extending each index to include `wiki_id` and `start`, but the user chose to go with the simpler 2-column non-unique indexes plus the unique 6-column index for data integrity instead. The debugger commit at HEAD was reset (mixed) before committing to keep it as local working state without including it in the branch. Session: ~10 user messages, mostly short questions and confirmations. No test runs for this commit (it's a migration-only change). The disk-space tradeoff analysis was done analytically from column widths and estimated row counts rather than from production data. (Commit message written by Claude Code.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

## Changes `app/models/article_course_user_wiki_timeslice.rb`: Adds `bulk_upsert_from_revisions` class method as a faster replacement for the per-(article, user) `find_or_create_by + save` loop. For a timeslice with N unique (article, user) pairs, the old path issued ~3N DB round-trips; the new path issues 1 query (one `upsert_all`). On duplicate key, only the stats columns are updated; the unique key columns and `tracked`/`created_at` are left untouched. Two private helpers split the computation to stay within RuboCop ABC and method-length limits: - `acuwt_records_from_revisions` — groups revisions, builds attribute hashes - `acuwt_revision_stats` — computes per-group stats (character_sum, references, new_article, first_revision, wikidata stats) - `acuwt_wikidata_stats` — wikidata-only wrapper around `UpdateWikidataStatsTimeslice#build_stats_from_revisions` `upsert_all` is called without `unique_by:` because MySQL's `INSERT ... ON DUPLICATE KEY UPDATE` resolves conflicts via all unique constraints automatically; specifying `unique_by:` raises `Mysql2Adapter does not support :unique_by`. `app/services/update_course_wiki_timeslices.rb`: `update_article_course_user_wiki_timeslices_for_wiki` now does one `acuwt_timeslice_for` lookup (already used by the downstream ACT/CUWT/CWT methods) and delegates to `bulk_upsert_from_revisions`. The per-group loop and the repeated CWT timeslice lookup (previously N calls, one per group) are gone. `app/services/update_timeslices_course_user.rb`: `create_acuwt_records_for_timeslice` collapsed from a per-group loop to a single `bulk_upsert_from_revisions` call — the timeslice boundaries are already known from the `cwt` argument, so no lookup is needed. ## Process The user asked to improve the `acuwt_updated` step, which accounted for 4.2 minutes (16% of total) in a full course update benchmark. Analysis of the write path identified the per-row `find_or_create_by + save` loop as the bottleneck. Claude Code proposed `upsert_all` and handled the RuboCop ABC split. First attempt included `unique_by:` which raised `Mysql2Adapter does not support :unique_by` on MySQL; removing it fixed the issue. Benchmarks after the fix confirmed `acuwt_updated` dropped from 4.2m to 0.3m (16× speedup), reducing the full update from 42.7m to 35.6m. Session: extended back-and-forth (~30 user messages) covering index design, disk-space analysis, benchmark interpretation, and the upsert_all implementation. User provided benchmark logs and error messages; direction was terse (a few words to a sentence per message). One failure before green: the `unique_by:` MySQL error caught during a live test run. (Commit message written by Claude Code.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`app/models/article_course_timeslice.rb`: Adds `bulk_update_from_acuwt` class method as a faster replacement for the per-article `update_from_acuwt` loop. The old path issued ~9 DB round-trips per article (find_or_create_by + 6 aggregate queries + save); the new path loads all ACUWT records for the timeslice in one SELECT, aggregates stats in Ruby, and writes all ACT rows in one `upsert_all`. Two private helpers split the computation to stay within RuboCop limits: - `act_records_from_acuwt` — fetches and groups ACUWT by article_id, builds attribute hashes - `act_stats_from_acuwt` — computes per-article stats (revision_count, character_sum, references_count, user_ids, new_article, first_revision) `user_ids` is `serialize :user_ids, type: Array`; Rails 8.1's `upsert_all` handles the YAML serialization automatically through the attribute type system. `unique_by:` is omitted (MySQL does not support it — existing unique index on `(article_id, course_id, start, end)` handles conflict resolution via `ON DUPLICATE KEY UPDATE`). `app/services/update_course_wiki_timeslices.rb`: - `update_article_course_timeslices_from_acuwt_for_wiki`: replaces the per-article loop (N calls to `update_from_acuwt`) with one `bulk_update_from_acuwt` call. - `reaggregate_timeslice_from_acuwt`: same replacement for the ACT portion. The pre-fetched `acuwt` relation is removed; CUWT still uses a direct `pluck(:user_id)` query since that path has not been bulk-optimized yet. The user asked whether `act_updated` could be made faster after observing it at 5.6m in the ACUWT path vs 3.9m in the default path. Claude Code identified that the same per-record pattern fixed for `acuwt_updated` (via upsert_all) applied here too. Implementation was straightforward — one RuboCop offense (`> 0` → `.positive?`) was caught and auto-corrected. Benchmarks confirmed `act_updated` dropped from 5.6m to 0.1m (56×), bringing the ACUWT full update below the default path total (29.8m vs 33.9m). The reaggregation path also benefited: 20-CWT reaggregation dropped from 3.3m to 0.1m for add-user, and from 2.2m to 0.1m for remove-user, because each CWT's ACT step now does a single bulk upsert instead of N×9 queries. Session: extended (~50 user messages), following directly from the upsert_all-for-ACUWT work. Terse user direction (a few words per message). One RuboCop failure; green on first spec run. Benchmarks were run externally by the user and shared as log JSON. (Commit message written by Claude Code.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

## Changes `app/models/article_course_timeslice.rb`: Removes `update_from_acuwt` class method and `update_cache_from_acuwt` instance method, both superseded by `bulk_update_from_acuwt`. Neither is called anywhere in the hot path after the previous commit. `spec/models/article_course_timeslice_spec.rb`: Replaces the two dead describe blocks (`.update_from_acuwt` and `#update_cache_from_acuwt`) with a single `.bulk_update_from_acuwt` block. The fixture data is consolidated — user3 (revision_count: 0) is included in the shared before block so the user_ids exclusion behavior is tested alongside the aggregate field checks. `spec/services/update_course_wiki_timeslices_spec.rb`: Updates the allow/expect stub for `ArticleCourseTimeslice` from `update_from_acuwt(course, article_id, wiki, ...)` to `bulk_update_from_acuwt(course, wiki, ...)` to match the new call signature. ## Process Straightforward cleanup following the bulk_update_from_acuwt commit. Spec changes were immediate — one spec run, 30 examples, 0 failures. (Commit message written by Claude Code.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gabina marked this pull request as draft June 4, 2026 13:57

gabina force-pushed the improve-performance-on-course-updates branch from 9c5879c to 66fa0f2 Compare June 4, 2026 14:01

gabina and others added 3 commits June 4, 2026 13:58

gabina changed the title ~~[WIP] Add indexes to article_course_user_wiki_timeslices~~ Add indexes to article_course_user_wiki_timeslices Jun 5, 2026

gabina marked this pull request as ready for review June 5, 2026 18:35

gabina changed the title ~~Add indexes to article_course_user_wiki_timeslices~~ [ACUWT] Add indexes to article_course_user_wiki_timeslices Jun 5, 2026

gabina merged commit d97c5be into WikiEducationFoundation:article-course-user-wiki-timeslices Jun 5, 2026
1 check passed

gabina mentioned this pull request Jun 15, 2026

[ACUWT] Consider using upsert_all for CUWT and CWT #6906

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ACUWT] Add indexes to article_course_user_wiki_timeslices#6888

[ACUWT] Add indexes to article_course_user_wiki_timeslices#6888
gabina merged 4 commits into
WikiEducationFoundation:article-course-user-wiki-timeslicesfrom
gabina:improve-performance-on-course-updates

gabina commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gabina commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Benchmarks

AI usage

Screenshots

Open questions and concerns

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabina commented Jun 4, 2026 •

edited

Loading