Skip to content

feat: add AION and SpecCLIP spectral foundation model adapters#46

Open
ksd3 wants to merge 7 commits intomainfrom
feat/spectral-model-adapters
Open

feat: add AION and SpecCLIP spectral foundation model adapters#46
ksd3 wants to merge 7 commits intomainfrom
feat/spectral-model-adapters

Conversation

@ksd3
Copy link
Copy Markdown
Collaborator

@ksd3 ksd3 commented Apr 12, 2026

Summary

  • Integrate AION (Polymathic AI, base 300M / large 900M / xlarge 3.1B) as an in-domain spectral foundation model for DESI spectra
  • Integrate SpecCLIP (43M, LAMOST LRS encoder) as a domain-mismatch control experiment
  • Register both in the experiment pipeline alongside the existing SpecFormer adapter

What was there before

  • SpecFormer (AstroCLIP, 43M) was the only spectral foundation model
  • No scaling analysis possible (single model, single size)
  • No experimental control for domain-mismatch effects

What changed

New files

File Description
src/pu/models/aion.py AION adapter — uses polymathic-aion codec to tokenize DESI spectra, then transformer encoder for embeddings. Supports base/large/xlarge sizes and fp16 loading.
src/pu/models/specclip.py SpecCLIP adapter — resamples DESI spectra to LAMOST wavelength grid (3700–9100 Å, 1462 pixels) via np.interp. Intentional out-of-domain control.
src/pu/models/specclip_arch.py Bundled SpecCLIP encoder architecture (SpecFormerControl20_wstd as plain nn.Module, no Lightning dependency).

Modified files

File Change
src/pu/models/__init__.py Register aion and specclip adapters
src/pu/experiments.py Extend is_spectral_model check; add AION (base/large) and SpecCLIP to model_map
pyproject.toml Add matplotlib>=3.8.0

Design decisions

Why AION? AION was trained on DESI spectra — it's in-domain for the existing data pipeline. Three sizes (300M/900M/3.1B) enable scaling analysis matching the pattern used for vision models (ViT base/large/huge, DINOv2 small/base/large/giant). All three sizes were updated on HuggingFace two days before this work.

Why SpecCLIP as a control? SpecCLIP was trained on LAMOST spectra. There is no LAMOST data on HuggingFace to stream, and no LAMOST-HSC crossmatch exists. The paper (Section 5.2) found that wavelength resampling across instruments fails due to systematic differences (calibration, detector, observing conditions). Running SpecCLIP on DESI spectra serves as an intentional negative control — confirming that domain mismatch makes embeddings degenerate.

Minimum reproducible example

from pu.models.aion import AIONAdapter, PreprocessAION
from datasets import load_dataset
import torch, numpy as np

# Load one DESI spectrum
ds = load_dataset("Smith42/desi_hsc_crossmatched", split="train", streaming=True)
sample = next(iter(ds))

# Preprocess and embed
preproc = PreprocessAION(["desi"])
result = preproc(sample)
adapter = AIONAdapter("polymathic-ai/aion-base", "base", alias="aion")
adapter.load()
batch = {k: torch.from_numpy(np.stack([v])) for k, v in result.items()}
emb = adapter.embed_for_mode(batch, "desi")
print(f"Embedding shape: {emb.shape}")

Expected output

Embedding shape: torch.Size([1, 768])

Full pipeline

# Generate AION embeddings on full DESI dataset (20,465 spectra, streamed)
pu run --model aion --mode desi

# Generate SpecCLIP embeddings (domain-mismatch control)
pu run --model specclip --mode desi

Test plan

  • All 167 existing tests pass (uv run pytest)
  • AION-base: 20,465 embeddings (dim=768) generated on full DESI dataset
  • AION-large: 20,465 embeddings (dim=1024) generated
  • SpecCLIP: 20,465 embeddings (dim=768) generated
  • Follows existing adapter/registry pattern exactly
  • Streaming only — no data downloads

ksd3 added 7 commits April 12, 2026 11:10
Adds AIONAdapter for the Polymathic AI AION multimodal foundation model.
Uses ConvNeXt1d codec to tokenize DESI spectra, then a transformer
encoder to produce continuous embeddings. Supports base (300M),
large (900M), and xlarge (3B) sizes for scaling analysis.
AION is in-domain for DESI spectra.
Bundles SpecFormerControl20_wstd from the SpecCLIP repo as a standalone
nn.Module (no Lightning dependency). Key differences from AstroCLIP
SpecFormer: pad=(1,0,1,0), stats token stores log10(std) only.
Reuses transformer building blocks from specformer_arch.py.
Adds SpecCLIPAdapter with PreprocessSpecCLIP that resamples DESI spectra
to the LAMOST wavelength grid (1462 pixels, 3700-9100 A). SpecCLIP was
trained on LAMOST LRS spectra, so running it on DESI data is an
intentional out-of-domain control experiment. Architecture bundled from
the SpecCLIP repo with matching state_dict key names.
Register aion and specclip adapters in models/__init__.py.
Update experiments.py to recognize both as spectral models and
add model_map entries for AION (3 sizes) and SpecCLIP.
Adds half=True parameter to AIONAdapter.load() for loading xlarge (3B)
model on GPUs with limited VRAM.
@Smith42
Copy link
Copy Markdown
Collaborator

Smith42 commented Apr 14, 2026

@ksd3 there are some conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants