In the clipped guided version, were GLIDE (filtered) and CLIP trained together?

In the training phase, were GLIDE (filtered) and CLIP trained together? Or they were trained separately but when inference, they are used together?