Ollama has a num_ctx parameter that controls the context window length - it defa...

anotherpaulg · on Jan 27, 2025

This is a harsh foot-gun that seems to harm many ollama users.

That 2k default is extremely low, and ollama *silently* discards the leading context. So users have no idea that most of their data hasn’t been provided to the model.

I’ve had to add docs [0] to aider about this, and aider overrides the default to at least 8k tokens. I’d like to do more, but unilaterally raising the context window size has performance implications for users.

Edit: Ok, aider now gives ollama users a clear warning when their chat context exceeds their ollama context window [1].

[0] https://aider.chat/docs/llms/ollama.html#setting-the-context...

[1] https://github.com/Aider-AI/aider/blob/main/aider/coders/bas...

magicalhippo · on Jan 27, 2025

There are several issues in the Ollama GitHub issue tracker related to this, like this[1] or this[2].

Fortunately it's easy to create a variant of the model with increased context size using the CLI[3] and then use that variant instead.

Just be mindful that longer context means more memory required[4].

[1]: https://github.com/ollama/ollama/issues/4967

[2]: https://github.com/ollama/ollama/issues/7043

[3]: https://github.com/ollama/ollama/issues/8099#issuecomment-25...

[4]: https://www.reddit.com/r/LocalLLaMA/comments/1848puo/comment...

neuralkoi · on Jan 27, 2025

Thank you! I was looking for how to do this. The example in the issue above shows how to increase the context size in ollama:

    $ ollama run llama3.2
    >>> /set parameter num_ctx 32768
    Set parameter 'num_ctx' to '32768'
    >>> /save llama3.2-32k
    Created new model 'llama3.2-32k'
    >>> /bye
    $ ollama run llama3.2-32k "Summarize this file: $(cat README.md)"
    ...

The table in the reddit post above also shows context size vs memory requirements for Model: 01-ai/Yi-34B-200K Params: 34.395B Mode: infer

    Sequence Length vs Bit Precision Memory Requirements
       SL / BP |     4      |     6      |     8      |     16
    --------------------------------------------------------------
           256 |     16.0GB |     24.0GB |     32.1GB |     64.1GB
           512 |     16.0GB |     24.1GB |     32.1GB |     64.2GB
          1024 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
          2048 |     16.1GB |     24.2GB |     32.3GB |     64.5GB
          4096 |     16.3GB |     24.4GB |     32.5GB |     65.0GB
          8192 |     16.5GB |     24.7GB |     33.0GB |     65.9GB
         16384 |     17.0GB |     25.4GB |     33.9GB |     67.8GB
         32768 |     17.9GB |     26.8GB |     35.8GB |     71.6GB
         65536 |     19.8GB |     29.6GB |     39.5GB |     79.1GB
        131072 |     23.5GB |     35.3GB |     47.0GB |     94.1GB
    *   200000 |     27.5GB |     41.2GB |     54.9GB |    109.8GB

    * Model Max Context Size

Code: https://gist.github.com/lapp0/d28931ebc9f59838800faa7c73e3a0...

eurekin · on Jan 27, 2025

Can context be split on multiple GPUs?

magicalhippo · on Jan 27, 2025

Not my field, but from this[1] blog post which references this[2] paper, it would seem so. Note the optimal approach are a bit different between training and inference. Also note that several of the approaches rely on batching multiple requests (prompts) in order to exploit the parallelism, so won't see the same gains if fed only a single prompt at a time.

[1]: https://medium.com/@plienhar/llm-inference-series-4-kv-cachi...

[2]: https://arxiv.org/abs/2104.04473

simonw · on Jan 26, 2025

Huh! I had incorrectly assumed that was for output, not input. Thanks!

YES that was it:

  files-to-prompt \
    ~/Dropbox/Development/llm \
    -e py -c | \
  llm -m q1m 'describe this codebase in detail' \
   -o num_ctx 80000

I was watching my memory usage and it quickly maxed out my 64GB so I hit Ctrl+C before my Mac crashed.

jmorgan · on Jan 26, 2025

Sorry this isn't more obvious. Ideally VRAM usage for the context window (the KV cache) becomes dynamic, starting small and growing with token usage, whereas right now Ollama defaults to a size of 2K which can be overridden at runtime. A great example of this is vLLM's PagedAttention implementation [1] or Microsoft's vAttention [2] which is CUDA-specific (and there are quite a few others).

1M tokens will definitely require a lot of KV cache memory. One way to reduce the memory footprint is to use KV cache quantization, which has recently been added behind a flag [3] and will 1/4 the memory footprint if 4-bit KV cache quantization is used (OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve)

[1] https://arxiv.org/pdf/2309.06180

[2] https://github.com/microsoft/vattention

[3] https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...

gcanyon · on Jan 27, 2025

I think Apple stumbled into a problem here, and I hope they solve it: reasonably priced Macs are -- by the new standards set by modern LLMs -- severely memory-constrained. MacBook Airs max out at 24GB. MacBook Pros go to 32GB for $2200, 48GB for something like $2800, and to get to 128GB requires shelling out over $4000. A Mini can get you to 64GB for $2000. A Mac Studio can get you to 96GB for $3000, or 192GB for $5600.

In this LLM era, those are rookie numbers. It should be possible to get a Mac with a lesser processor but at least 256GB of memory for $2000. I realize part of the issue is the lead time for chip design -- since Mac memory is an integral part of the chip, and the current crop were designed before the idea of running something like an LLM locally was a real probability.

But I hope the next year or two show significant increases in the default (and possible) memory for Macs.

senko · on Jan 27, 2025

> It should be possible to get a Mac with a lesser processor but at least 256GB of memory for $2000.

Apple is not known for leaving money on the table like that.

Also, projects like NVidia DIGITS ($2k for 128G) might make Apple unwilling to enter the market. As you said, Studio with 192G is $5600k. For purely AI purposes, two DIGITS' are a better choice, and non-AI usages don't need such ludicros amount of RAM (maybe for video, but those customers are willing to pay more).

gcanyon · on Jan 27, 2025

> Apple is not known for leaving money on the table like that.

True -- although I will say the M series chips were a step change in performance and efficiency from the Intel processors they replaced, and Apple didn't charge a premium for them.

I'm not suggesting that they'll stop charging more for RAM than the industry at large -- I'm hoping they'll unbundle RAM from CPU-type. A base Mac Mini goes for $600, and adding RAM costs $200 per 8GB. That's a ridiculous premium, clearly, and at that rate my proposed Mac Mini with 256GB of RAM would go for $6600 -- which would roll my eyes until they fell out of my head.

But Apple is also leaving money on the table if they're not offering a more expensive model people would buy. A 128GB Mini, let's say, for $2000, might be that machine.

All that said, it's also a heck of a future-proof machine, so maybe the designed-obsolescence crowd have an argument to make here.

amrrs · on Jan 26, 2025

This has been the problem with a lot of long context use cases. It's not just the model's support but also sufficient compute and inference time. This is exactly why I was excited for Mamba and now possibly Lightning attention.

Even though the new DCA based on which these models provide long context could be an interesting area to watch;

thot_experiment · on Jan 26, 2025

Ollama is a "easymode" LLM runtime and as such has all the problems that every easymode thing has. It will assume things and the moment you want to do anything interesting those assumptions will shoot you in the foot, though I've found ollama plays so fast and loose even first party things that "should just work" do not. For example if you run R1 (at least as of 2 days ago when i tried this) using the default `ollama run deepseek-r1:7b` you will get different context size, top_p and temperature vs what Deepseek recommends in their release post.

xigency · on Jan 27, 2025

Ollama definitely is a strange beast. The sparseness of the documentation seems to imply that things will 'just work' and yet, they often don't.

rahimnathwani · on Jan 26, 2025

Yup, and this parameter is supported by the plugin he's using:

https://github.com/taketwo/llm-ollama/blob/4ccd5181c099af963...