FitMyLLM — Independent benchmarks for self-hosted AI
(www.fitmyllm.com)
from anzo@programming.dev to degoogle@lemmy.ml on 03 Jun 10:58
https://programming.dev/post/51407596
from anzo@programming.dev to degoogle@lemmy.ml on 03 Jun 10:58
https://programming.dev/post/51407596
cross-posted from: programming.dev/post/51407459
Check what can you use and at what rate of token per seconds would it be… It has examples of many models and quantization levels. Huge resource!
threaded - newest
What has this to do with degoogling?
Gemini is one of the most used LLMs. This shows alternatives.
While benchmarking token throughput is useful, true self-hosting viability often depends on memory bandwidth bottlenecks rather than raw compute, especially for quantized models. Have you evaluated how different quantization levels impact inference latency on consumer-grade GPUs compared to the reported token-per-second figures?