In the training phase, were GLIDE (filtered) and CLIP trained together? Or they were trained separately but when inference, they are used together?