Humans proceed with "jailbreak" and stop the AI model to produce harmful results.

Please let us know your free update

A new method of artificial intelligence is a new method to prevent users from bringing out harmful content from the models to protect the risks caused by major technology groups such as Microsoft and metrace from the most advanced technology. Is demonstrated.

A paper published on Monday explained an outline of a new system called the “Constitutional Classification”. This is a model that functions as a protective layer on a large language model, such as monitoring both input and output of harmful content, and moving a human Claude chatbot.

The artificial development during discussions to raise $ 2 billion with a $ 60 billion evaluation is increasing in industry concerns about Jail Break. We try to generate illegal or dangerous information, such as manipulating AI models to create instructions for building chemical weapons.

Other companies are also competing for the fact that companies can safely adopt AI models, which helps them to avoid regulatory scrutiny, and to provide measures to protect them. Microsoft introduced Prompt Shields last March, but Meta introduced a quick guard model last July.

MRINANK Sharma, a member of human technical staff, states: [weapon] thing [but] The real advantage of this method is the ability to respond quickly and adaptively. “

Humans said that they would not use the system immediately in the current Claude model, but said they would consider implementing it if a risk model was released in the future. Sharma added as follows.

The solution proposed by the startup has been built based on the so -called rules of the so -called rules, which can be defined and restricted and adapt to various types of materials.

It is well known that some jailbreak attempts are to use abnormal capitalization at the prompt or use the grandmother's persona to ask a model to talk about the evil topics. 。

To verify the effectiveness of the system, humankind has provided up to $ 15,000 “bug bounty” to individuals who have tried to bypass security measures. These testers, known as Red Teamers, spent more than 3,000 hours trying to break through their defense.

Anthropic's Claude 3.5 Sonnet model refused to over 95 % of the classified tricks compared to 14 % without safe guards.

Major high -tech companies are trying to reduce the number of mistakes in models, but are trying to maintain their usefulness. In many cases, if the easing means is introduced, the model may be cautious and refuse benign requests such as Google's Gemini Image Generator and Meta's LLAMA 2 initial version. “

However, if these protections are added, you will be charged an additional cost for companies that have already paid a large amount of computing power required for model training and execution. Humanity stated that the “reasoning overhead”, which is the cost of performing a model, will increase by almost 24 %.

Test virtualato performed with the latest model showing the effectiveness of human classifier

Security experts argued that such a generated chatbot accessible characteristics have enabled ordinary people without prior knowledge to extract dangerous information.

“In 2016, the threat actor we kept in mind was a really powerful national state enemy,” said Ram Shankar Siva Kumar, who leads Microsoft's AI RED team. “Now, one of my threat actors is a teenager with the mouth of the toilet.”

Source link

What's Hot

Humans proceed with “jailbreak” and stop the AI ​​model to produce harmful results.

Related Posts

Subscribe to Updates

Humans proceed with “jailbreak” and stop the AI model to produce harmful results.