llama.cpp Qwen2.5-0.5B-Rag-Thinking-Flan-T5
- I use forked llama-cpp-python which support T5 on server and it's doesn't support new models(like gemma3)
- Search query generation(query reformulation) Tasks - I use flan-t5-base (large make better result,but too large for just this task)
- Qwen2.5-0.5B as good as small-size.
- anyway google T5 series on CPU is amazing
Huggingface Free CPU Limitations
- When duplicating a space, the build process can occasionally become stuck, requiring a manual restart to finish.
- Spaces may unexpectedly stop functioning or even be deleted, leading to the need to rework them. Refer to issue for more information.
Examples
Model
Select the AI model to use for chat
1024 8192
0.1 2
0.1 1
1 100
1 2