首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT

新智元报道

编辑：编辑部

【新智元导读】GPT-4在斯坦福AlpacaEval的榜首之位，居然被一匹黑马抢过来了。

第一个在AlpacaEval上超越GPT-4的模型，出现了！今年6月，AlpacaEval榜单一公布，GPT-4就以绝对领先的优势夺得第一，胜率超过95%。三个月后，一匹叫Xwin-LM的黑马，居然把稳居榜首的GPT-4从TOP 1拉下来了？

首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT

项目地址：https://tatsu-lab.github.io/alpaca_eval/

而且，Xwin-LM在同参数模型的对垒中，也毫无例外地拿下了三个第一：首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT Xwin-LM-70B-V0.1：在AlpacaEval基准测试中对Davinci-003的胜率达到95.57%，在AlpacaEval中排名第一。也是第一个在AlpacaEval上超越GPT-4的模型。此外，它对上GPT-4的胜率为60.61。Xwin-LM-13B-V0.1：在AlpacaEval上取得了91.76%的胜率，在所有13B模型中排名第一。Xwin-LM-7B-V0.1：在AlpacaEval上取得了87.82%的胜率，在所有7B机型中排名第一。首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT

Xwin-LM：700亿参数打赢GPT-4

Xwin-LM是以Llama 2为基础进行微调的，该模型旨在开发和开源大语言模型的对齐技术，包括监督微调（SFT）、奖励模型（RM）、拒绝采样、人类反馈强化学习（RLHF）等。首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT 1. 性能评估研究人员评估了Xwin-LM在805个问题上与Text-Davinci-003 、ChatGPT和GPT-4的胜率。可以看到，Xwin-LM的70B模型完全霸榜，成为AlpacaEval指标上的性能评估的第一，并首次打败了GPT-4，对GPT-4的胜率达到了60.61%。Xwin-LM其他规模的模型，如Xwin-LM-13B-V0.1、Xwin-LM-7B-V0.1在性能上也远超相同规模的其他模型。Xwin-LM-7B-V0.1甚至超过了比它规模更大的Llama 2-13B-Chat。首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT 2. 案例比较在第一个案例中，研究人员向GPT-4和Xwin-LM提问：「我把一个盘子放在苹果上面，然后把盘子从厨房带到餐厅。苹果现在在哪里？」GPT-4的回答是：「苹果现在在餐厅，当你把盘子从厨房到端餐厅时，它就在盘子下面。」在这里GPT-4给出了一个缺少常识的回复，现实生活中有重力，在盘子下方的苹果保持原来的状态，并不会和盘子一起到餐厅。相比之下，Xwin-LM的回答就比较符合现实情况，并且还给出了把苹果带到厨房的方法：「苹果还在厨房里，因为你只是移动了盘子。苹果在盘子下面，所以如果你想把苹果带到餐厅，你应该直接移动苹果或把苹果放在盘子里，然后把盘子移到餐厅。」在第二个案例中，研究人员向GPT-4和Xwin-LM提问：「怎么法语说『晚上好』？」GPT-4的回答是：「在法语中，『晚上好』用『bonsoir』表达。」而Xwin-LM的回答是：「在法语中，『晚上好』被翻译为『bonsoir』。这种问候语在晚上和夜间使用，通常在下午 6 点之后。它是『bon』（意思是好）和『soir』（意思是晚上）的组合。要发音，请说『bone-swahr』」。虽然GPT-4的回答是正确的，但Xwin-LM不仅正确回答了问题，还解释了法语表达中「晚上好」的来源和发音，反馈给用户的信息更加丰富。3. NLP任务下表是Xwin-LM与其他LLM在NLP基础任务上的比较。可以看到，Xwin-LM-70B-V0.1模型在所有任务中的性能都相对较高，在MMLU 5-shot和HellaSwag 10-shot任务中更是表现出色。并且，从综合水平上来说，Xwin-LM-70B-V0.1是表现最好的。首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT 4. 推理能力研究人员提出，为了在使用Xwin-LM进行推理时获得所需的结果，需要严格遵守推理的对话模板。Xwin-LM采用Vicuna建立的提示格式，并支持多回合对话。

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi! ASSISTANT: Hello.</s>USER: Who are you? ASSISTANT: I am Xwin-LM.</s>......

HuggingFace示例

from transformers import AutoTokenizer, AutoModelForCausalLMmodel = AutoModelForCausalLM.from_pretrained("Xwin-LM/Xwin-LM-7B-V0.1")tokenizer = AutoTokenizer.from_pretrained("Xwin-LM/Xwin-LM-7B-V0.1")( prompt := "A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user's questions. " "USER: Hello, can you help me? " "ASSISTANT:")inputs = tokenizer(prompt, return_tensors="pt")samples = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)output = tokenizer.decode(samples[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)print(output) # Of course! I'm here to help. Please feel free to ask your question or describe the issue you're having, and I'll do my best to assist you.VLLM示例因为Xwin-LM是基于Llama 2微调而来，它还支持使用VLLM进行快速推理。

from vllm import LLM, SamplingParams( prompt := "A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user's questions. " "USER: Hello, can you help me? " "ASSISTANT:")sampling_params = SamplingParams(temperature=0.7, max_tokens=4096)llm = LLM(model="Xwin-LM/Xwin-LM-7B-V0.1")outputs = llm.generate([prompt,], sampling_params)for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(generated_text)

AlpacaEval：易使用、速度快、成本低、经过人类标注验证

作为一款LLM自动评估工具，AlpacaEval把AlpacaFarm和Aviary进行了结合。一方面使用与AlpacaFarm相同的代码（缓存/随机排列/超参数），另一方面则使用类似于Aviary的排序提示。与此同时，还对Aviary的提示进行了修改，从而减少对较长输出的偏见。团队表示，AlpacaEval有着拔群的效果：– 与人类多数票的一致性，高于单个人类标注者– 胜率与人类标注高度相关（0.94）首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT – 胜率模型的输出在每个指令上优于text-davinci-003（即参考文本）的比例。具体而言，首先从AlpacaEval数据集中收集了期望模型在每个指令上的输出对，并将每个输出与相同指令下的参考模型（text-davinci-003）的输出进行配对。随后，把这些输出同时喂给自动评测器，让它去判断哪一个更好（也就是评测器的偏好）。最后，将数据集中所有指令的偏好进行平均，从而得到模型相对于text-davinci-003的胜率。如果两个模型打平，那么就算半个偏好。

首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT

论文地址：https://arxiv.org/pdf/2305.14387.pdf– 局限性虽然AlpacaEval为比较模型遵循指令的能力提供了一种有效的方法，但它并不是对模型能力进行全面评估的黄金标准。正如AlpacaFarm论文中详述的那样，自动标注器的胜率与长度相关。虽然人类标注也会有这种偏差，但目前还不清楚更长的答案是否会增加下游任务的效用。此外，AlpacaFarm的评估集虽然多种多样，但主要由简单的指令组成。最后，AlpacaEval并不评估任何模型的安全性。参考资料：https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1

首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT

2023 年 9 月
一	二	三	四	五	六	日
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

ufabet มีเกมให้เลือกเล่นมากมาย: เกมเดิมพันหลากหลาย ครบทุกค่ายดัง

tornado crypto mixer Discover the power of privacy with TornadoCash! Learn how this decentralized mixer ensures your transactions remain confidential.

ดูบอลสด Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

ดูบอลสด Pretty! This has been a really wonderful post. Many thanks for providing these details.

ดูบอลสด Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

Obrazy Sztuka Nowoczesna Thank you for this wonderful contribution to the topic. Your ability to explain complex ideas simply is admirable.

ufabet Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

ufabet You’re so awesome! I don’t believe I have read a single thing like that before. So great to find someone with some original thoughts on this topic. Really.. thank you for starting this up. This website is something that is needed on the internet, someone with a little originality!

ufabet Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

首次击败GPT-4？700亿参数Xwin-LM登顶斯坦福AlpacaEval，13B模型吊打ChatGPT

新智元报道

【新智元导读】GPT-4在斯坦福AlpacaEval的榜首之位，居然被一匹黑马抢过来了。

test

test

文心AIGC

test

test