llama.cpp Qwen2.5-0.5B-Rag-Thinking-Flan-T5

I use forked llama-cpp-python which support T5 on server and it's doesn't support new models(like gemma3)
Search query generation(query reformulation) Tasks - I use flan-t5-base (large make better result,but too large for just this task)
Qwen2.5-0.5B as good as small-size.
anyway google T5 series on CPU is amazing

Huggingface Free CPU Limitations

When duplicating a space, the build process can occasionally become stuck, requiring a manual restart to finish.
Spaces may unexpectedly stop functioning or even be deleted, leading to the need to rework them. Refer to issue for more information.

Chatbot

Response

Examples