Skip to content

feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming (re-added)#216

Merged
kaylawilding merged 1 commit intodevelopfrom
revert-215-revert-208-feat/legacy-school-classifier
Mar 11, 2026
Merged

feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming (re-added)#216
kaylawilding merged 1 commit intodevelopfrom
revert-215-revert-208-feat/legacy-school-classifier

Conversation

@chapmanhk
Copy link
Contributor

feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming

changes

  • Institutions API

    • Add legacy_id to institution model and create/update flows. Enforce mutual exclusivity: at most one of pdp_id, edvise_id, or legacy_id per institution via has_at_most_one_school_type().
    • Auto-assign edvise_id / legacy_id (e.g. edvise_N, legacy_N) on create when is_edvise / is_legacy is set and no id provided. Reject create when more than one school type is requested.
  • Validation (data upload)

    • Legacy path: For institutions with legacy_id, use schema_namespace = "legacy": any CSV format (encoding + read only, no schema validation), then PII column check before moving to raw/validated. student_id is excluded from PII (treated as non-PII).
    • Legacy + arbitrary filenames: Fetch institution before filename inference. When the filename does not imply student/course/semester, legacy schools get allowed_schemas = ["UNKNOWN"] instead of 422; non-legacy still receive 422 for non-descriptive filenames.
    • validation_helper refactor: Split into helpers under 50 lines; add full docstrings (Args/Returns/Raises), early validation (empty file name 422, invalid inst_id 404 with logging), and module-level constants for cache TTLs.
  • Naming

    • Use “Edvise Schema (ES)” in user-facing messages, docstrings, and comments where the schema type (not the product) is meant.
  • Tests

    • Institutions: explicit legacy_id on create, PATCH to add legacy_id, bucket/Databricks failures, reject both edvise + legacy.
    • Validation/data: legacy header-only CSV, legacy PII → 400, legacy arbitrary filename → 200 with file_types: ["UNKNOWN"], empty file name → 422, invalid inst_id → 404, edvise non-descriptive filename → 422, duplicate validate idempotent.
    • Unit tests for _infer_allowed_schemas_from_filename and _ext_models_set; utilities test for has_at_most_one_school_type.

context

  • Legacy schools need to upload CSVs without conforming to PDP/Edvise schema or filename conventions. This branch introduces a dedicated “legacy” institution type and validation path: encoding + CSV read + PII check only, with no schema validation.
  • Arbitrary filenames for legacy avoid 422s when filenames don’t include keywords like “student” or “course”; those files are stored with schema type UNKNOWN. Downstream (EDA, model runs) that require STUDENT/COURSE still behave as before (404/400 when only UNKNOWN is present).
  • PII check prevents legacy uploads with columns that look like PII (e.g. email, SSN) from being written to raw/validated.
  • Edvise Schema (ES) naming reduces confusion between the schema type and the product name in API and logs.

deployment

Before or as part of deploying this branch, the database schema must include the legacy_id column on the institution table. If it is not already present, run:

ALTER TABLE inst ADD COLUMN legacy_id VARCHAR(36) NULL;

(This matches pdp_id / edvise_id in the schema; VAR_CHAR_LENGTH is 36 in the codebase.)

questions

None


Note

Medium Risk
Medium risk because it changes institution creation/update rules and the core upload validation flow, including a new bypass path that accepts arbitrary CSVs (gated by legacy_id) and new PII-based rejections.

Overview
Adds Legacy schools by introducing InstTable.legacy_id and exposing it through institutions read/create/update responses, with enforced mutual exclusivity across pdp_id, edvise_id, and legacy_id plus optional auto-assignment of edvise_id/legacy_id on create.

Refactors upload validation routing (validation_helper) into smaller helpers, adds stricter input validation (empty filenames, invalid institution IDs), and introduces a legacy validation path (institution_id="legacy") that skips schema validation, reads the CSV as-is, and blocks uploads whose column names look like PII.

Updates documentation/messages to consistently use “Edvise Schema (ES)”, adjusts PII detection to treat student_id as non-PII, and expands tests/fixtures to cover legacy behavior, new error cases, and updated masking expectations.

Written by Cursor Bugbot for commit 9f01339. This will update automatically on new commits. Configure here.


@chapmanhk chapmanhk requested a review from vishpillai123 March 7, 2026 19:02
@kaylawilding kaylawilding merged commit faebf9e into develop Mar 11, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants