This is a harsh foot-gun that seems to harm many ollama users.
That 2k default is extremely low, and ollama *silently* discards the leading context. So users have no idea that most of their data hasn’t been provided to the model.
I’ve had to add docs [0] to aider about this, and aider overrides the default to at least 8k tokens.
I’d like to do more, but unilaterally raising the context window size has performance implications for users.
Edit: Ok, aider now gives ollama users a clear warning when their chat context exceeds their ollama context window [1].
Thank you! I was looking for how to do this. The example in the issue above shows how to increase the context size in ollama:
$ ollama run llama3.2
>>> /set parameter num_ctx 32768
Set parameter 'num_ctx' to '32768'
>>> /save llama3.2-32k
Created new model 'llama3.2-32k'
>>> /bye
$ ollama run llama3.2-32k "Summarize this file: $(cat README.md)"
...
The table in the reddit post above also shows context size vs memory requirements for Model: 01-ai/Yi-34B-200K
Params: 34.395B
Mode: infer
Not my field, but from this[1] blog post which references this[2] paper, it would seem so. Note the optimal approach are a bit different between training and inference. Also note that several of the approaches rely on batching multiple requests (prompts) in order to exploit the parallelism, so won't see the same gains if fed only a single prompt at a time.
Sorry this isn't more obvious. Ideally VRAM usage for the context window (the KV cache) becomes dynamic, starting small and growing with token usage, whereas right now Ollama defaults to a size of 2K which can be overridden at runtime. A great example of this is vLLM's PagedAttention implementation [1] or Microsoft's vAttention [2] which is CUDA-specific (and there are quite a few others).
1M tokens will definitely require a lot of KV cache memory. One way to reduce the memory footprint is to use KV cache quantization, which has recently been added behind a flag [3] and will 1/4 the memory footprint if 4-bit KV cache quantization is used (OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve)
I think Apple stumbled into a problem here, and I hope they solve it: reasonably priced Macs are -- by the new standards set by modern LLMs -- severely memory-constrained. MacBook Airs max out at 24GB. MacBook Pros go to 32GB for $2200, 48GB for something like $2800, and to get to 128GB requires shelling out over $4000. A Mini can get you to 64GB for $2000. A Mac Studio can get you to 96GB for $3000, or 192GB for $5600.
In this LLM era, those are rookie numbers. It should be possible to get a Mac with a lesser processor but at least 256GB of memory for $2000. I realize part of the issue is the lead time for chip design -- since Mac memory is an integral part of the chip, and the current crop were designed before the idea of running something like an LLM locally was a real probability.
But I hope the next year or two show significant increases in the default (and possible) memory for Macs.
> It should be possible to get a Mac with a lesser processor but at least 256GB of memory for $2000.
Apple is not known for leaving money on the table like that.
Also, projects like NVidia DIGITS ($2k for 128G) might make Apple unwilling to enter the market. As you said, Studio with 192G is $5600k. For purely AI purposes, two DIGITS' are a better choice, and non-AI usages don't need such ludicros amount of RAM (maybe for video, but those customers are willing to pay more).
> Apple is not known for leaving money on the table like that.
True -- although I will say the M series chips were a step change in performance and efficiency from the Intel processors they replaced, and Apple didn't charge a premium for them.
I'm not suggesting that they'll stop charging more for RAM than the industry at large -- I'm hoping they'll unbundle RAM from CPU-type. A base Mac Mini goes for $600, and adding RAM costs $200 per 8GB. That's a ridiculous premium, clearly, and at that rate my proposed Mac Mini with 256GB of RAM would go for $6600 -- which would roll my eyes until they fell out of my head.
But Apple is also leaving money on the table if they're not offering a more expensive model people would buy. A 128GB Mini, let's say, for $2000, might be that machine.
All that said, it's also a heck of a future-proof machine, so maybe the designed-obsolescence crowd have an argument to make here.
This has been the problem with a lot of long context use cases. It's not just the model's support but also sufficient compute and inference time. This is exactly why I was excited for Mamba and now possibly Lightning attention.
Even though the new DCA based on which these models provide long context could be an interesting area to watch;
Ollama is a "easymode" LLM runtime and as such has all the problems that every easymode thing has. It will assume things and the moment you want to do anything interesting those assumptions will shoot you in the foot, though I've found ollama plays so fast and loose even first party things that "should just work" do not. For example if you run R1 (at least as of 2 days ago when i tried this) using the default `ollama run deepseek-r1:7b` you will get different context size, top_p and temperature vs what Deepseek recommends in their release post.