Skip to content

[ACUWT] Add indexes to article_course_user_wiki_timeslices#6888

Merged
gabina merged 4 commits into
WikiEducationFoundation:article-course-user-wiki-timeslicesfrom
gabina:improve-performance-on-course-updates
Jun 5, 2026
Merged

[ACUWT] Add indexes to article_course_user_wiki_timeslices#6888
gabina merged 4 commits into
WikiEducationFoundation:article-course-user-wiki-timeslicesfrom
gabina:improve-performance-on-course-updates

Conversation

@gabina

@gabina gabina commented Jun 4, 2026

Copy link
Copy Markdown
Member

What this PR does

Adds three indexes to article_course_user_wiki_timeslices and implements
bulk upsert_all writes for ACUWT and ACT rows.

  • Indexes: unique index on (course_id, article_id, user_id, wiki_id, start, end)
    for data integrity, plus (course_id, wiki_id) and (course_id, user_id) for
    deletion and lookup queries.
  • ACUWT bulk write (bulk_upsert_from_revisions): replaces the per-(article, user)
    find_or_create_by + save loop with a single upsert_all, reducing write overhead
    from N×3 queries to 2 per timeslice window.
  • ACT bulk write (bulk_update_from_acuwt): replaces the per-article
    find_or_create_by + N aggregate queries + save loop with a single SELECT and
    upsert_all, reducing N×9 queries to 2 per timeslice window.

Benchmarks

Small article scoped program on wikidata: Latinoamérica en Wikidata 2024 - Bolivia.

Full update Step ACUWT path (s) default path (s) difference (s)
  wikidata_stats_fetched 1708.7 1738.8  
  revisions_fetched 31.9 32.9  
  acuwt_updated 14.5 0 14.5
  scores_fetched 10.8 11  
  cuwt_updated 4.3 4.2  
  uploads_imported 4 3.9  
  act_updated 3.7 234.6 -230.9
  articles_courses_updated 3.6 3.6  
  cwt_updated 2.3 3.6  
  timeslices_recreated 1.3 1.4  
  courses_users_updated 0.2 0.2  
  course_cache_updated 0.1 0.1  
  article_namespaces_updated 0.1 0  
  wikidata_stats_updated 0 0  
  average_pageviews_updated 0 0  
  categories_updated 0 0  
  reaggregation 0 0  
  wiki_namespace_stats_updated 0 0  
  timeslices_processed_3 0 0  
  Total 1785.5 2034.3 -248.8

Remove user Step ACUWT path (s) default path (s) difference (s)
  wikidata_stats_fetched 45.5 949.2 -903.7
  revisions_fetched 0.8 19 -18.2
  acuwt_updated 0.3 0 0.3
  scores_fetched 0 6.7 -6.7
  cuwt_updated 0.1 2.2 -2.1
  uploads_imported 2.1 2.1  
  act_updated 0.1 42.8 -42.7
  articles_courses_updated 2.7 0 2.7
  cwt_updated 0 2.2 -2.2
  timeslices_recreated 0 0  
  courses_users_updated 0.1 0.1  
  course_cache_updated 0.1 0.1  
  article_namespaces_updated 0 0  
  wikidata_stats_updated 0 0  
  average_pageviews_updated 26.5 27.1  
  categories_updated 0 0  
  reaggregation 5 0 5
  wiki_namespace_stats_updated 0 0  
  timeslices_processed_3 0 0  
  removed_user_cwt_marked 0.1 0.1  
  Total 83.4 1051.6 -968.2

Add new user Step ACUWT path (s) default path (s) difference (s)
  new_user_revisions_fetched 394   394
  wikidata_stats_fetched 70.9 1337 -1266.1
  revisions_fetched 1 22.6 -21.6
  acuwt_updated 0.4   0.4
  scores_fetched 0 6.8 -6.8
  cuwt_updated 0.1 1.9 -1.8
  uploads_imported 3.8 3.9  
  act_updated 0.1 84.9 -84.8
  articles_courses_updated 3 0.1 2.9
  cwt_updated 0.1 2.5 -2.4
  timeslices_recreated 0    
  courses_users_updated 0.2 0.2  
  course_cache_updated 0.1 0.1  
  article_namespaces_updated 0    
  wikidata_stats_updated 0    
  average_pageviews_updated 12.7 22.2  
  categories_updated 0    
  reaggregation 6.1   6.1
  wiki_namespace_stats_updated 0    
  timeslices_processed_3 0    
  new_user_acuwt_written 3    
  timeslices_course_user_updated 0 1.7 -1.7
  Total 101.5 1483.9 -1382.4

AI usage

Claude Code (Sonnet 4.6) was used to analyze query patterns across all callers
of the table to inform index choices, draft the bulk write implementations, and
write commit messages. The human directed all decisions on what to build and ran
all benchmarks.

This PR description was drafted using the /prepare-pr Claude Code skill.

Screenshots

No UI changes.

Open questions and concerns

The use_acuwt? flag must currently be set per-course via flags[:use_acuwt]
— there is no admin UI to toggle it. Enabling it for production courses requires
a console or direct DB update.

The CUWT reaggregation step still uses a per-user loop rather than upsert_all.
A bulk write optimization there would be a natural follow-up.

@gabina gabina marked this pull request as draft June 4, 2026 13:57
`db/migrate/20260602000001_add_indexes_to_article_course_user_wiki_timeslices.rb`:
Adds three indexes to the `article_course_user_wiki_timeslices` table (ACUWT):

- **Unique index** on `(course_id, article_id, user_id, wiki_id, start, end)`:
  Enforces data integrity — mirrors the pattern of sibling tables
  (`course_user_wiki_timeslices` has a unique index on the equivalent 5-column
  key). Also directly speeds up the `find_or_create_by` call in
  `update_article_course_user_wiki_timeslices`, which fires for every revision
  processed.

- **`(course_id, wiki_id)`**: Covers deletion queries in `TimesliceCleaner`
  (`delete_existing_article_course_user_wiki_timeslices`) and serves as a fast
  prefix for the many queries that filter by course and wiki before applying
  further conditions. The unique index's leading prefix already covers this
  access pattern partially, but a dedicated 2-column index avoids walking the
  wider B-tree for these lightweight lookups.

- **`(course_id, user_id)`**: Covers deletion and cleanup queries in
  `UpdateTimeslicesCourseUser` and `TimesliceCleaner` that filter only by
  course and user, without wiki or time constraints.

The choice of these three indexes (rather than the originally proposed
`(course_id, article_id)`, `(course_id, wiki_id)`, `(course_id, user_id)`)
was informed by analysis of actual query patterns across ACT, CUWT, and CWT
population paths.

The user asked for the migration after confirming through benchmarks that the
ACUWT path is faster than the legacy path. The initial proposal from the user
was three 2-column composite indexes. Before writing the migration, the actual
query patterns were analyzed across all callers of ACUWT — models, services,
and lib — to evaluate whether wider indexes would be more selective. That
analysis suggested extending each index to include `wiki_id` and `start`, but
the user chose to go with the simpler 2-column non-unique indexes plus the
unique 6-column index for data integrity instead. The debugger commit at HEAD
was reset (mixed) before committing to keep it as local working state without
including it in the branch.

Session: ~10 user messages, mostly short questions and confirmations. No test
runs for this commit (it's a migration-only change). The disk-space tradeoff
analysis was done analytically from column widths and estimated row counts
rather than from production data.

(Commit message written by Claude Code.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gabina gabina force-pushed the improve-performance-on-course-updates branch from 9c5879c to 66fa0f2 Compare June 4, 2026 14:01
gabina and others added 3 commits June 4, 2026 13:58
## Changes

`app/models/article_course_user_wiki_timeslice.rb`:
Adds `bulk_upsert_from_revisions` class method as a faster replacement for
the per-(article, user) `find_or_create_by + save` loop. For a timeslice with
N unique (article, user) pairs, the old path issued ~3N DB round-trips; the
new path issues 1 query (one `upsert_all`). On duplicate key, only the stats
columns are updated; the unique key columns and `tracked`/`created_at` are
left untouched. Two private helpers split the computation to stay within
RuboCop ABC and method-length limits:
- `acuwt_records_from_revisions` — groups revisions, builds attribute hashes
- `acuwt_revision_stats` — computes per-group stats (character_sum, references,
  new_article, first_revision, wikidata stats)
- `acuwt_wikidata_stats` — wikidata-only wrapper around
  `UpdateWikidataStatsTimeslice#build_stats_from_revisions`

`upsert_all` is called without `unique_by:` because MySQL's
`INSERT ... ON DUPLICATE KEY UPDATE` resolves conflicts via all unique
constraints automatically; specifying `unique_by:` raises
`Mysql2Adapter does not support :unique_by`.

`app/services/update_course_wiki_timeslices.rb`:
`update_article_course_user_wiki_timeslices_for_wiki` now does one
`acuwt_timeslice_for` lookup (already used by the downstream ACT/CUWT/CWT
methods) and delegates to `bulk_upsert_from_revisions`. The per-group loop
and the repeated CWT timeslice lookup (previously N calls, one per group)
are gone.

`app/services/update_timeslices_course_user.rb`:
`create_acuwt_records_for_timeslice` collapsed from a per-group loop to a
single `bulk_upsert_from_revisions` call — the timeslice boundaries are
already known from the `cwt` argument, so no lookup is needed.

## Process

The user asked to improve the `acuwt_updated` step, which accounted for 4.2
minutes (16% of total) in a full course update benchmark. Analysis of the
write path identified the per-row `find_or_create_by + save` loop as the
bottleneck. Claude Code proposed `upsert_all` and handled the RuboCop ABC
split. First attempt included `unique_by:` which raised
`Mysql2Adapter does not support :unique_by` on MySQL; removing it fixed the
issue. Benchmarks after the fix confirmed `acuwt_updated` dropped from 4.2m
to 0.3m (16× speedup), reducing the full update from 42.7m to 35.6m.

Session: extended back-and-forth (~30 user messages) covering index design,
disk-space analysis, benchmark interpretation, and the upsert_all
implementation. User provided benchmark logs and error messages; direction
was terse (a few words to a sentence per message). One failure before green:
the `unique_by:` MySQL error caught during a live test run.

(Commit message written by Claude Code.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`app/models/article_course_timeslice.rb`:
Adds `bulk_update_from_acuwt` class method as a faster replacement for the
per-article `update_from_acuwt` loop. The old path issued ~9 DB round-trips
per article (find_or_create_by + 6 aggregate queries + save); the new path
loads all ACUWT records for the timeslice in one SELECT, aggregates stats in
Ruby, and writes all ACT rows in one `upsert_all`. Two private helpers split
the computation to stay within RuboCop limits:
- `act_records_from_acuwt` — fetches and groups ACUWT by article_id, builds
  attribute hashes
- `act_stats_from_acuwt` — computes per-article stats (revision_count,
  character_sum, references_count, user_ids, new_article, first_revision)

`user_ids` is `serialize :user_ids, type: Array`; Rails 8.1's `upsert_all`
handles the YAML serialization automatically through the attribute type system.
`unique_by:` is omitted (MySQL does not support it — existing unique index on
`(article_id, course_id, start, end)` handles conflict resolution via
`ON DUPLICATE KEY UPDATE`).

`app/services/update_course_wiki_timeslices.rb`:
- `update_article_course_timeslices_from_acuwt_for_wiki`: replaces the
  per-article loop (N calls to `update_from_acuwt`) with one
  `bulk_update_from_acuwt` call.
- `reaggregate_timeslice_from_acuwt`: same replacement for the ACT portion.
  The pre-fetched `acuwt` relation is removed; CUWT still uses a direct
  `pluck(:user_id)` query since that path has not been bulk-optimized yet.

The user asked whether `act_updated` could be made faster after observing it
at 5.6m in the ACUWT path vs 3.9m in the default path. Claude Code identified
that the same per-record pattern fixed for `acuwt_updated` (via upsert_all)
applied here too. Implementation was straightforward — one RuboCop offense
(`> 0` → `.positive?`) was caught and auto-corrected. Benchmarks confirmed
`act_updated` dropped from 5.6m to 0.1m (56×), bringing the ACUWT full
update below the default path total (29.8m vs 33.9m).

The reaggregation path also benefited: 20-CWT reaggregation dropped from
3.3m to 0.1m for add-user, and from 2.2m to 0.1m for remove-user, because
each CWT's ACT step now does a single bulk upsert instead of N×9 queries.

Session: extended (~50 user messages), following directly from the
upsert_all-for-ACUWT work. Terse user direction (a few words per message).
One RuboCop failure; green on first spec run. Benchmarks were run externally
by the user and shared as log JSON.

(Commit message written by Claude Code.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Changes

`app/models/article_course_timeslice.rb`:
Removes `update_from_acuwt` class method and `update_cache_from_acuwt`
instance method, both superseded by `bulk_update_from_acuwt`. Neither is
called anywhere in the hot path after the previous commit.

`spec/models/article_course_timeslice_spec.rb`:
Replaces the two dead describe blocks (`.update_from_acuwt` and
`#update_cache_from_acuwt`) with a single `.bulk_update_from_acuwt` block.
The fixture data is consolidated — user3 (revision_count: 0) is included
in the shared before block so the user_ids exclusion behavior is tested
alongside the aggregate field checks.

`spec/services/update_course_wiki_timeslices_spec.rb`:
Updates the allow/expect stub for `ArticleCourseTimeslice` from
`update_from_acuwt(course, article_id, wiki, ...)` to
`bulk_update_from_acuwt(course, wiki, ...)` to match the new call signature.

## Process

Straightforward cleanup following the bulk_update_from_acuwt commit.
Spec changes were immediate — one spec run, 30 examples, 0 failures.

(Commit message written by Claude Code.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gabina gabina changed the title [WIP] Add indexes to article_course_user_wiki_timeslices Add indexes to article_course_user_wiki_timeslices Jun 5, 2026
@gabina gabina marked this pull request as ready for review June 5, 2026 18:35
@gabina gabina changed the title Add indexes to article_course_user_wiki_timeslices [ACUWT] Add indexes to article_course_user_wiki_timeslices Jun 5, 2026
@gabina gabina merged commit d97c5be into WikiEducationFoundation:article-course-user-wiki-timeslices Jun 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant