Where needed you can make models deterministic by setting temperature to 0 with ...

celestialcheese · on April 19, 2023

> Where needed you can make models deterministic by setting temperature to 0 with a fixed model version

This isn't guaranteed - you'll still get differing responses to the same input with 0 temp. Here's one explanation [1]

Likewise, for certain tasks, the magic of the LLMs doesn't come through unless temp > 0. More specifically, with text cleanup tasks from OCR output, if temp=0, i've found that GPT-3.5/4 doesn't do as good of a job fixing really broken output.

That being said, you can mitigate this with proper evals and good old fashioned output validation.

1 - https://twitter.com/goodside/status/1608525976702525440?lang...

dwallin · on April 19, 2023

That's a really cool fact, thanks for sharing.

It's also worth noting that whether an LLM is deterministic or not is a matter of what token is selected. If it turns out to be valuable for results to be deterministic it is a tractable problem. You just need a token selection algorithm with deterministic results, which doesn't need to be something as simple as "always pick the top result". Seeds are a thing, and are used in diffusion models for exactly that reason.

lsy · on April 19, 2023

Simply holding temp to 0 or making the selection deterministic isn't an adequate solution unless the process is always run with the same set of inputs (at which point why not run the model once on all inputs and create a map?).

Ultimately with LLMs it's not possible (or desirable) to keep inputs separate from the rest of the prompt, so changing "give me the top X" to "give me the top Y" has the potential for a wildly different result. With traditional code we can achieve reliability because we sanitize and apply bounds to inputs, which then transit through the logic in a way we can reason about. The strength and weakness of an LLM is that it mashes up the input with the rest of its data and prompts, meaning we cannot predict the output for an arbitrary set of inputs.

dwallin · on April 19, 2023

Correct me if I misunderstand, but your point is that even if the textual content can be made deterministic the (form/shape/type?) of the output is not deterministic?

If you are expecting a specific format you can just check whether the LLM outputs the correct format, and return null or an error in that case. Given the input is arbitrary text, and assuming a non-trivial transformation, traditional code would need a way of handling failure cases anyways. This means your function either way would look something like:

Item from Universal Set -> Value of Type | Null

You need to reject the entire set of invalid inputs. However the sets of all valid and invalid inputs are often both infinite themselves and it’s also not guaranteed that this set is actually computable. Alternatively in these cases (and most commonly) you construct a calculable subset of inputs and reject the rest. However this means you are still rejecting an infinite number of valid inputs.

On the other hand an LLM always returns a value. Your job as a programmer using an LLM is instead to validate and narrow down the result type as much as possible. However the way they work means that for many cases they can output a valid output for a much wider range of inputs than you could do with traditional code in a reasonable amount of code complexity. For many tasks this is transformational.

charcircuit · on April 19, 2023

That is specific to OpenAI and not LLM in general. The nondeterministic part is how you sample the output. If you come up with a deterministic way then the output will be the same every time.

thewataccount · on April 19, 2023

Even this can differ depending on the hardware and I think possibly driver versions IIRC.

The stable diffusion crowd has ran into this issue.

williamcotton · on April 19, 2023

I’ve had good luck with temp > 0, getting 5+ responses, and then having a mechanism that chooses the best response.

If the response is expected to be factual then the voting mechanism is just “pick the one with the most responses”, eg, you asked for the code to compute and return the standard deviation of some list of numbers… if four of the responses are 5.4 and one is 5.6, there are more votes for 5.4.

tehsauce · on April 19, 2023

Non-determinism is due to the implementation rather than the fundamental method. In principle a language model can be executed deterministically with any temperature you want.