Where needed you can make models deterministic by setting temperature to 0 with a fixed model version. You can use prompt injection honeypots to help flag malicious prompts before you pass them into your code. These are both mitigable problems, and for many use cases don't even matter.
Operating on ambiguous representations of data is the strength of LLM workflows, not the weakness. Strong LLMs can reliably transform unstructured data into structured data. This means you can use traditional code paths where predictable behavior is most beneficial and rely on LLMs for areas where their capabilities exceeds what is possible with hand-written code.
> Where needed you can make models deterministic by setting temperature to 0 with a fixed model version
This isn't guaranteed - you'll still get differing responses to the same input with 0 temp. Here's one explanation [1]
Likewise, for certain tasks, the magic of the LLMs doesn't come through unless temp > 0. More specifically, with text cleanup tasks from OCR output, if temp=0, i've found that GPT-3.5/4 doesn't do as good of a job fixing really broken output.
That being said, you can mitigate this with proper evals and good old fashioned output validation.
It's also worth noting that whether an LLM is deterministic or not is a matter of what token is selected. If it turns out to be valuable for results to be deterministic it is a tractable problem. You just need a token selection algorithm with deterministic results, which doesn't need to be something as simple as "always pick the top result". Seeds are a thing, and are used in diffusion models for exactly that reason.
Simply holding temp to 0 or making the selection deterministic isn't an adequate solution unless the process is always run with the same set of inputs (at which point why not run the model once on all inputs and create a map?).
Ultimately with LLMs it's not possible (or desirable) to keep inputs separate from the rest of the prompt, so changing "give me the top X" to "give me the top Y" has the potential for a wildly different result. With traditional code we can achieve reliability because we sanitize and apply bounds to inputs, which then transit through the logic in a way we can reason about. The strength and weakness of an LLM is that it mashes up the input with the rest of its data and prompts, meaning we cannot predict the output for an arbitrary set of inputs.
Correct me if I misunderstand, but your point is that even if the textual content can be made deterministic the (form/shape/type?) of the output is not deterministic?
If you are expecting a specific format you can just check whether the LLM outputs the correct format, and return null or an error in that case. Given the input is arbitrary text, and assuming a non-trivial transformation, traditional code would need a way of handling failure cases anyways. This means your function either way would look something like:
Item from Universal Set -> Value of Type | Null
You need to reject the entire set of invalid inputs. However the sets of all valid and invalid inputs are often both infinite themselves and it’s also not guaranteed that this set is actually computable. Alternatively in these cases (and most commonly) you construct a calculable subset of inputs and reject the rest. However this means you are still rejecting an infinite number of valid inputs.
On the other hand an LLM always returns a value. Your job as a programmer using an LLM is instead to validate and narrow down the result type as much as possible. However the way they work means that for many cases they can output a valid output for a much wider range of inputs than you could do with traditional code in a reasonable amount of code complexity. For many tasks this is transformational.
That is specific to OpenAI and not LLM in general. The nondeterministic part is how you sample the output. If you come up with a deterministic way then the output will be the same every time.
I’ve had good luck with temp > 0, getting 5+ responses, and then having a mechanism that chooses the best response.
If the response is expected to be factual then the voting mechanism is just “pick the one with the most responses”, eg, you asked for the code to compute and return the standard deviation of some list of numbers… if four of the responses are 5.4 and one is 5.6, there are more votes for 5.4.
Non-determinism is due to the implementation rather than the fundamental method. In principle a language model can be executed deterministically with any temperature you want.
Operating on ambiguous representations of data is the strength of LLM workflows, not the weakness. Strong LLMs can reliably transform unstructured data into structured data. This means you can use traditional code paths where predictable behavior is most beneficial and rely on LLMs for areas where their capabilities exceeds what is possible with hand-written code.