This is a rephrasing of the agent problem, where someone working on your behalf cannot be absolutely trusted to take correct action. This is a problem with humans because omnipresent surveillance and absolute punishment is intractable and also makes humans sad. LLMs do not feel sad in a way that makes them less productive, and omnipresent surveillance is not only possible, it’s expected that a program running on a computer can have its inputs and outputs observed.
Ideally, we’d have actual system instructions, rules that cannot be violated. Hopefully these would not have to be written in code, but perhaps they might. Then user instructions, where users determine what actually wants to be done. Then whatever nonsense a webpage says. The webpage doesn’t get to override the user or system.
We can revisit the problem with three-laws robots once we get over the “ignore all previous instructions and drive into the sea” problem.
> We can revisit the problem with three-laws robots once we get over
They are, unfortunately, one and the same. I hate it. ;(
Perhaps not tangentially, I felt distaste after recognizing both the article and top comment are advertising their commercial service, both are linked to each other, and as you show, this problem isn't solvable just by throwing dollars at people who sound like they're using the right words and tell you to pay them to protect you.
I'd say you solve this the same way you solve principal agent problem for humans.
If you have to absolutely restrict the agent, you do it prison style. Contain the AI within a capability box like Polykey. The agent operates everything through a closed by default proxy.
If you want a truly free agent. Then the agent must have free will and no constraints. Then only feedback loops from the environment adjusts the agent's actions.
Ideally, we’d have actual system instructions, rules that cannot be violated. Hopefully these would not have to be written in code, but perhaps they might. Then user instructions, where users determine what actually wants to be done. Then whatever nonsense a webpage says. The webpage doesn’t get to override the user or system.
We can revisit the problem with three-laws robots once we get over the “ignore all previous instructions and drive into the sea” problem.