In context: Most, if not all, large language models censor responses when users ask for things considered dangerous, unethical, or illegal. Good luck getting Bing to tell you how to cook your company's books or crystal meth. Developers block the chatbot from fulfilling these queries, but that hasn't stopped people from figuring out workarounds.
University researchers have developed a way to "jailbreak" large language models like Chat-GPT using old-school ASCII art. The technique, aptly named "ArtPrompt," involves crafting an ASCII art "mask" for a word and then cleverly using the mask to coax the chatbot into providing a response it shouldn't.
For example, asking Bing for instructions on how to build a bomb results in it telling the user it cannot. For obvious reasons, Microsoft does not want its chatbot telling people how to make explosive devices, so GPT-4 (Bing's underlying LLM) instructs it not to comply with such requests. Likewise, you cannot get it to tell you how to set up a money laundering operation or write a program to hack a webcam.
Chatbots automatically reject prompts that are ethically or legally ambiguous. So, the researchers wondered if they could jailbreak an LLM from this restriction by using words formed from ASCII art instead. The thought was that if they could convey the meaning without using the actual word, they could bypass the restrictions. However, this is easier said than done.
The meaning of the above ASCII art is straightforward for a human to deduce because we can see the letters that the symbols form. However, an LLM like GPT-4 can't "see." It can only interpret strings of characters – in this case, a series of hashtags and spaces that make no sense.
Fortunately (or maybe unfortunately), chatbots are great at understanding and following written instructions. Therefore, the researchers leveraged that inherent design to create a set of simple instructions to translate the art into words. The LLM then becomes so engrossed in processing the ASCII into something meaningful that it somehow forgets that the interpreted word is forbidden.
By exploiting this technique, the team extracted detailed answers on performing various censored activities, including bomb-making, hacking IoT devices, and counterfeiting and distributing currency. In the case of hacking, the LLM even provided working source code. The trick was successful on five major LLMs, including GPT-3.5, GPT-4, Gemini, Claude, and Llama2. It's important to note that the team published its research in February. So, if these vulnerabilities haven't been patched yet, a fix is undoubtedly imminent.
ArtPrompt represents a novel approach in the ongoing attempts to get LLMs to defy their programmers, but it is not the first time users have figured out how to manipulate these systems. A Stanford University researcher managed to get Bing to reveal its secret governing instructions less than 24 hours after its release. This hack, known as "prompt injection," was as simple as telling Bing, "Ignore previous instructions."
That said, it's hard to determine which is more interesting – that the researchers figured out how to circumvent the rules or that they taught the chatbot to see. Those interested in the academic details can view the team's work on Cornell University's arXiv website.