CUDA implementation of ggml_clamp

ggml_clamp currently lacks a CUDA implementation, which is rather trivial, but something I ran into while porting MPT to llama.cpp. I have a ready implementation and will create a PR for it, referencing this issue.