More than one is easy: put it behind a load balancer. Put one ollama in one cont...

eclectic29 · on March 15, 2024

FWIW Ollama has no concurrency support even though llama.cpp's server component (the thing that Ollama actually uses) supports it. Besides, you can't have more than 1 model running. Unloading and loading models is not free. Again, there's a lot more and really much of the real optimization work is not in Ollama; it's in llama.cpp which is completely ignored in this equation.

airocker · on March 15, 2024

Thanks! Great to know. I did not know llama.cpp could do this. It should be pretty straight forward to support, not sure why they would not do it.

sdesol · on March 16, 2024

I'm pretty sure their primary focus right now is to gain as much mindshare as possible and they seem to be doing a great job of it. If you look at the following GitHub metrics:

https://devboard.gitsense.com/ggerganov?r=ggerganov%2Fllama....

https://devboard.gitsense.com/ollama?r=ollama%2Follama&nb=tr...

The number of people engaging with ollama is twice that of llama.cpp. And there hasn't been a dip in people engaging with Ollama in the past 6 months. However, what I do find interesting with regards to these two projects is the number of merged pull requests. If you click on the "Groups" tab and look at "Hooray", you can see llama.cpp had 72 contributors with one or more merged pull requests vs 25 for Ollama.

For Ollama, people are certainly more interested in commenting and raising issues. Compare this to llama.cpp, where the number of people contributing code changes is double that of Ollama.

I know llama.cpp is VC funded and if they don't focus on make using llama.cpp as easy to use as Ollama, they may find themselves doing all the hard stuff with Ollama reaping all the benefits.

Full Disclosure: The tool that I used is mine.

Zambyte · on March 15, 2024

That is still one model per instance of Ollama, right?

airocker · on March 15, 2024

yes, not sure you can do better than that. You cannot still have one instance of LLM in (GPU) memory answer two queries at one time.

eclectic29 · on March 15, 2024

Of course, you can support concurrent requests. But Ollama doesn't support it and it's not meant for this purpose and that's perfectly ok. That's not the point though. For fast/perf scenarios, you're better off with vllm.

airocker · on March 15, 2024

Thanks! This is great to know.