The CPUOptimizerOffload class is very clever, but overly relies on CUDA Streams, which aren't available w/o a CUDA device. should use `torch.cpu.Stream` and `torch.cpu.current_stream` instead. additionally, `pin_memory=True if torch.cuda.is_available() else False` as MPS is a unified mem arch.