The agent's goal is to get more rewards. (Reinforcement )

Reliable resource for comparing and exploring mobile phones.
Post Reply
rriiffaatt77
Posts: 5
Joined: Mon Dec 23, 2024 4:12 pm

The agent's goal is to get more rewards. (Reinforcement )

Post by rriiffaatt77 »

Second, it involves the environment. The environment can be the dog owner's home, a programming environment, or a vertical field. Third, it involves actions, whether the dog sits or the result of some other modality. Fourth, it involves a reward model, which is also important. The two most important factors are the environment and the agent.unicorn) The idea of ​​reinforcement learning in language models is essentially to replace inference time with training time. . Why is RLHF better than SFT? The proponent of the PPO algorithm is Dr.



Berkeley and former OpenAI researcher John Shulman. He belgium email list put forward two views on RLHF PPO: First, SFT can cause hallucinations: John Shulman believes that large models produce hallucinations because they learn some misperceptions during the SFT phase. Too strong a supervision signal from SFT leads to people actually leading ChatGPT to say things it doesn’t know. There is also the possibility that GPT actually knows the answer, but the annotators don’t. Second, RLHF gives the large model “knows” that it “really doesn’t know.” The RLHF process not only helps the model understand uncertainty, but more importantly, RLHF helps the model improve its inference ability.



Only through double experiments, both forward and backward, can we judge that this medicine is effective for headaches. If there are only positive examples, for example, if a patient takes a cold medicine and his cold improves, this does not prove that the cold medicine can cure the cold. It can only be shown that there is a certain correlation between the cold medicine and the improvement of the cold in patients. RLHF successfully uses negative data, giving the model the opportunity to truly understand causality. In short, RLHF has the following advantages: Use a negative signal for comparative learning, which can help the model reduce hallucinations through the comparison process.
Post Reply