feat: add AION and SpecCLIP spectral foundation model adapters by ksd3 · Pull Request #46 · UniverseTBD/platonic-universe

ksd3 · 2026-04-12T15:11:38Z

Summary

Integrate AION (Polymathic AI, base 300M / large 900M / xlarge 3.1B) as an in-domain spectral foundation model for DESI spectra
Integrate SpecCLIP (43M, LAMOST LRS encoder) as a domain-mismatch control experiment
Register both in the experiment pipeline alongside the existing SpecFormer adapter

What was there before

SpecFormer (AstroCLIP, 43M) was the only spectral foundation model
No scaling analysis possible (single model, single size)
No experimental control for domain-mismatch effects

What changed

New files

File	Description
`src/pu/models/aion.py`	AION adapter — uses `polymathic-aion` codec to tokenize DESI spectra, then transformer encoder for embeddings. Supports base/large/xlarge sizes and fp16 loading.
`src/pu/models/specclip.py`	SpecCLIP adapter — resamples DESI spectra to LAMOST wavelength grid (3700–9100 Å, 1462 pixels) via `np.interp`. Intentional out-of-domain control.
`src/pu/models/specclip_arch.py`	Bundled SpecCLIP encoder architecture (`SpecFormerControl20_wstd` as plain nn.Module, no Lightning dependency).

Modified files

File	Change
`src/pu/models/__init__.py`	Register `aion` and `specclip` adapters
`src/pu/experiments.py`	Extend `is_spectral_model` check; add AION (base/large) and SpecCLIP to `model_map`
`pyproject.toml`	Add `matplotlib>=3.8.0`

Design decisions

Why AION? AION was trained on DESI spectra — it's in-domain for the existing data pipeline. Three sizes (300M/900M/3.1B) enable scaling analysis matching the pattern used for vision models (ViT base/large/huge, DINOv2 small/base/large/giant). All three sizes were updated on HuggingFace two days before this work.

Why SpecCLIP as a control? SpecCLIP was trained on LAMOST spectra. There is no LAMOST data on HuggingFace to stream, and no LAMOST-HSC crossmatch exists. The paper (Section 5.2) found that wavelength resampling across instruments fails due to systematic differences (calibration, detector, observing conditions). Running SpecCLIP on DESI spectra serves as an intentional negative control — confirming that domain mismatch makes embeddings degenerate.

Minimum reproducible example

from pu.models.aion import AIONAdapter, PreprocessAION
from datasets import load_dataset
import torch, numpy as np

# Load one DESI spectrum
ds = load_dataset("Smith42/desi_hsc_crossmatched", split="train", streaming=True)
sample = next(iter(ds))

# Preprocess and embed
preproc = PreprocessAION(["desi"])
result = preproc(sample)
adapter = AIONAdapter("polymathic-ai/aion-base", "base", alias="aion")
adapter.load()
batch = {k: torch.from_numpy(np.stack([v])) for k, v in result.items()}
emb = adapter.embed_for_mode(batch, "desi")
print(f"Embedding shape: {emb.shape}")

Expected output

Embedding shape: torch.Size([1, 768])

Full pipeline

# Generate AION embeddings on full DESI dataset (20,465 spectra, streamed)
pu run --model aion --mode desi

# Generate SpecCLIP embeddings (domain-mismatch control)
pu run --model specclip --mode desi

Test plan

All 167 existing tests pass (uv run pytest)
AION-base: 20,465 embeddings (dim=768) generated on full DESI dataset
AION-large: 20,465 embeddings (dim=1024) generated
SpecCLIP: 20,465 embeddings (dim=768) generated
Follows existing adapter/registry pattern exactly
Streaming only — no data downloads

Adds AIONAdapter for the Polymathic AI AION multimodal foundation model. Uses ConvNeXt1d codec to tokenize DESI spectra, then a transformer encoder to produce continuous embeddings. Supports base (300M), large (900M), and xlarge (3B) sizes for scaling analysis. AION is in-domain for DESI spectra.

Bundles SpecFormerControl20_wstd from the SpecCLIP repo as a standalone nn.Module (no Lightning dependency). Key differences from AstroCLIP SpecFormer: pad=(1,0,1,0), stats token stores log10(std) only. Reuses transformer building blocks from specformer_arch.py.

Adds SpecCLIPAdapter with PreprocessSpecCLIP that resamples DESI spectra to the LAMOST wavelength grid (1462 pixels, 3700-9100 A). SpecCLIP was trained on LAMOST LRS spectra, so running it on DESI data is an intentional out-of-domain control experiment. Architecture bundled from the SpecCLIP repo with matching state_dict key names.

Register aion and specclip adapters in models/__init__.py. Update experiments.py to recognize both as spectral models and add model_map entries for AION (3 sizes) and SpecCLIP.

Adds half=True parameter to AIONAdapter.load() for loading xlarge (3B) model on GPUs with limited VRAM.

Smith42 · 2026-04-14T16:13:55Z

@ksd3 there are some conflicts

ksd3 added 7 commits April 12, 2026 11:10

build: add matplotlib dependency

1bea7ee

feat: register new spectral models and update experiment pipeline

aa32592

Register aion and specclip adapters in models/__init__.py. Update experiments.py to recognize both as spectral models and add model_map entries for AION (3 sizes) and SpecCLIP.

chore: limit AION sizes to base/large (xlarge exceeds 12GB VRAM)

9480b6b

feat: add half-precision loading support for AION adapter

a773299

Adds half=True parameter to AIONAdapter.load() for loading xlarge (3B) model on GPUs with limited VRAM.

ksd3 requested a review from Smith42 April 12, 2026 15:11

This was referenced Apr 12, 2026

feat: add layer-wise CKA/MKNN analysis pipeline for spectral models #47

Closed

feat: generic module-level embedding extraction + HF Hub upload #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add AION and SpecCLIP spectral foundation model adapters#46

feat: add AION and SpecCLIP spectral foundation model adapters#46
ksd3 wants to merge 7 commits intomainfrom
feat/spectral-model-adapters

ksd3 commented Apr 12, 2026

Uh oh!

Smith42 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ksd3 commented Apr 12, 2026

Summary

What was there before

What changed

New files

Modified files

Design decisions

Minimum reproducible example

Expected output

Full pipeline

Test plan

Uh oh!

Smith42 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants