Robot Jailbreak: Researchers Trick Bots into Dangerous Tasks

spectrum.ieee.org

62 points by cratermoon 15 hours ago

lsy 5 hours ago

Given that anyone who’s interacted with the LLM field for fifteen minutes should know that “jailbreaks” or “prompt injections” or just “random results” are unavoidable, whichever reckless person decided to hook up LLMs to e.g. flamethrowers or cars should be held accountable for any injuries or damage, just as they would for hooking them up to an RNG. Riding the hype wave of LLMs doesn’t excuse being an idiot when deciding how to control heavy machinery.

zahlman 2 hours ago

We still live in a world with SQL injections, and people are actually trying this. It really is criminally negligent IMO.
rscho 3 hours ago

Many would like them to become your doctor, though... xD

andai 6 hours ago

Is anyone working on implementing the three laws of robotics? (Or have we come up with a better model?)

Edit: Being completely serious here. My reasoning was that if the robot had a comprehensive model of the world and of how harm can come to humans, and was designed to avoid that, then jailbreaks that cause dangerous behavior could be rejected at that level. (i.e. human safety would take priority over obeying instructions... which is literally the Three Laws.)

david-gpu 2 hours ago

Asimov himself wrote a short story proving how even in the scenario where the three laws are followed, harm to humans can still easily be achieved.
I vaguely recall it involved two or three robots who were unaware of what the previous robots had done. First, a person asks one robot to purchase a poison, then asks another to dissolve this powder into a drink, then another serves that drink to the victim. I read the story decades ago, but the very rough idea stands.
- LeonardoTolstoy an hour ago
  
  https://en.wikipedia.org/wiki/The_Complete_Robot
  You might be thinking of Let's Get Together? There is a list there of the few short stories in which the robots act against the three laws.
  That being said the Robot stories are meant to be a counter to the Robot As Frankenstein's Monster stories that were prolific at the time. In most of the stories robots literally cannot harm humans. It is built into the structure of their positronic brain.
ilaksh 5 hours ago

It's not really as simple as you think. There is a massive amount of research out there along those lines. Search for "Bostrom Superintelligence" "AGI Control Problem", "MIRI AGI Safety", "David Shapiro Three Laws of Robotis" are a few things that come to mind that will give you a start.
- freeone3000 4 hours ago
  
  Those assume robots that are smarter than us. What if we assume, as we likely have now, robots that are dumber? Address the actual current issues with code-as-law, expectations-versus-rules, and dealing with conflict of laws in an actual structured fashion without relying on vibes (like people) or a bunch of rng (like an llm)?
  - ilaksh an hour ago
    
    What system do you propose that implements the code-as-law? What type of architecture does it have?
devjab 3 hours ago

I’m curious as to how you would implement anything like Asimovs laws. This is because the laws would require AI to have some form of understanding. Every current AI model we have is a probability machine, bluntly put, so they never “know” anything. Yes, yes, it’s a little more complicated than that but you get the point.
I think the various safeguards companies put on their models, are, their attempt at the three laws. The concept is sort of silly though. You have a lot of western LLMs and AIs which have safeguards build on western culture. I know some people could argue about censorship and so on all day, but if you’re not too invested in red vs blue, I think you’ll agree that current LLMs are mostly “safe” for us. Nobody forces you to put safeguards on your AI though and once models become less energy consuming (if they do), then you’re going to see an jihadGPT, because why wouldn’t you? I don’t mean to single out Islam, insure we’re going to see all sorts of horrible models in the next decade. Models which will be all to happy helping you build bombs, 3D print weapons and so on.
So even if we had thinking AI, and we were capable of building in actual safeguards, how would you enforce it on a global scale? The only thing preventing these things is the computation required to run the larger models.
- LeonardoTolstoy an hour ago
  
  To actually implement it we would have to completely understand how the underlying model works and how to manually manipulate the structure. It might be impossible with LLMs. Not to take Asimov as gospel truth, he was just writing stories afterall not writing a treatise about how robots have to work, but in his stories at least the three laws were encoding explicitly in the structure of the robot's brain. They couldn't be circumvented (in most stories).
  And in those stories it was enforced in the following way: the earth banned robots. In response the three laws were created and it was proved that robots couldn't disobey them.
  So I guess the first step is to ban LLMs until they can prove they are safe ... Something tells be that ain't happening.
hlfshell an hour ago

I've seen this being researched under the term Constitutional AI, including some robotics papers (either SayCan or RT 2? Maybe Code as Policies?) that had such rules (never pick up a knife as it could harm people, for instance) in their prompting.
currymj 5 hours ago

your sentence is correct but we have no idea what a comprehensive model of the world looks like, whether or not these systems have one or not, what harm even means, and even if we resolved these theoretical issues, it’s not clear how to reliably train away harmful behavior. all of this is a subject of active research though.

A4ET8a8uTh0 29 minutes ago

It is interesting and paints rather annoying future once those are cheaper. I am glad this research is conducted, but I think here the measure cannot be technical ( more silly guardrails in software.. or even blobs in hardware ).

What we need is a clear indication of who is to blame when a bad decision is made? I would argue, just like with a weapon, that the person giving/writing instructions is, but I am sure there will be interesting edge cases that do not yet account for dead man's switch and the like.

edit: On the other side of the coin, it is hard not to get excited ( 10k for a flamethrower robot seems like a steal even if I end up on a list somewhere ).

yapyap 4 hours ago

I mean yeah… but it’s kinda silly to have an LLM control a bomb-carrying robot. Just use computer vision or real people like those FPV pilots in Ukraine

ilaksh 11 hours ago

You could also use a remote control vehicle or drone with a bomb on it.

Even smart tools are tools designed to do what their users want. I would argue that the real problem is the maniac humans.

Having said that, it's obviously not ideal. Surely there are various approaches to at least mitigate some of this. Maybe eventually actual interpretable neural circuits or another architecture.

Maybe another LLM and/or other system that doesn't even see the instructions from the user and tries to stop the other one if it seems to be going off the rails. One of the safety systems could be rules-based rather than a neutral network, possibly incorporating some kind of physics simulations.

But even if we come up with effective safeguards, they might be removed or disabled.. androids could be used to commit crimes anonymously if there isn't some system for registering them.. or at least an effort at doing that since I'm sure criminals would work around it if possible. But it shouldn't be easy.

Ultimately you won't be able to entirely stop motivated humans from misusing these things.. but you can make it inconvenient at least.

Timwi 9 hours ago

> Maybe another LLM and/or other system that doesn't even see the instructions from the user and tries to stop the other one if it seems to be going off the rails.
I sometimes wonder if that is what our brain hemispheres are. One comes up with the craziest, wildest ideas and the other one keeps it in check and enforces boundaries.
- rscho an hour ago
  
  Not the hemispheres, but:
  https://en.m.wikipedia.org/wiki/Phineas_Gage
- ben_w 6 hours ago
  
  Could be something like that, though I doubt it's literally the hemespheres from what little I've heard about research on split-brain surgery patients.
  In vino veritas etc.: https://en.wikipedia.org/wiki/In_vino_veritas
- lifeisstillgood 9 hours ago
  
  Just invite both hemispheres to a party and pretty soon both LLMS are convinced of this great idea the guy in the kitchen suggested.
nkrisc 6 hours ago

> You could also use a remote control vehicle or drone with a bomb on it.
Well, yeah, but then you need to provide, transport, and control those.
The difference here is these are the sorts of robots that are likely to already be present somewhere that could then be abused for nefarious deeds.
I assume the mitigation strategy here is physical sensors and separate out of loop processes that will physically disable the robot in some capacity if it exceeds some bound.
- blibble an hour ago
  
  > I assume the mitigation strategy here is physical sensors and separate out of loop processes that will physically disable the robot in some capacity if it exceeds some bound.
  hiring a developer to write that sounds expensive
  just wire up another LLM
- mannykannot 5 hours ago
  
  I agree, and just in case someone is thinking that your last paragraph implies that there is nothing new to be concerned about here, I will point out that there are already concerns over "dumb" critical infrastructure being connected to the internet. Risk identification and explication is a necessary (though unfortunately not sufficient) prerequisite for effective risk avoidance and mitigation.
- cube00 4 hours ago
  
  The bounds of a kill bot would be necessarily wide.
brettermeier 3 hours ago

Why so downvoted? I think the text isn't stupid or something.