也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

今天是2023年12月22日，北京，天气晴，冬至日。

已到年终，所以目前陆续都出来了很多的总结。

例如，对于RAG的重要担当，langchain也发布了其2023的总结报告，其中有些数字很有趣，比如大家都在用什么向量化工具、数据库，RAG评估都在关注哪些点，可以看看。

而最近的一些工作发现，在pretrain阶段加入一些任务型数据，对于提升模型能力有直接帮助，但如何更快速的构造数据，尤为重要，因此，我们来看看面向信息抽取的一些可用的数据集一集对应多样化prompt，这些很有帮助。‍

供大家一起参考。

一、从langchain的2023报告看一些主流数据‍‍‍‍

在地址https://blog.langchain.dev/langchain-state-of-ai-2023/中，可以看到一些有趣的结论。例如：

1、大家都喜欢用什么大模型基座

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

2、大家都在用什么向量化数据库

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

3、大家都在用哪些向量化工具

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

4、大家都在用哪些检索策略

其中：

自我查询Self Query，指从用户的问题中提取元数据过滤(https://python.langchain.com/docs/modules/data_connection/retrievers/self_query)；

混合搜索Hybrid Search(https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)；

上下文压缩Contextual Compression对基础检索结果进行后处理(https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/)；

https://drive.google.com/uc?id=1CtNgWODXZudxAWSRiWgSGEoTNrUFT98v

多重查询Multi Query将单个查询转化为多个查询，然后检索所有查询的结果；时间加权向量化存储TimeWeighted VectorStore更优先考虑最近的文档。

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

这个TimeWeighted VectorStore比较陌生，可以从https://python.langchain.com/docs/modules/data_connection/retrievers/time_weighted_vectorstore?ref=blog.langchain.dev中找到对应的介绍。

其计算公式为：

semantic_similarity + (1.0 - decay_rate) ^ hours_passed

其中，hours_passed指的是检索器中对象被最后一次访问以来所经过的小时数，可以看到，这个在聊天对话内容的召回上用的会比较多。

4、大家都怎么进行的测试

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

5、langchain评估都在关注什么

可以看到，大多数人仍然主要关注应用程序的正确性，而不是毒性、提示泄漏或其他防护措施，此外，从精确匹配作为一种评估技术的低使用率中可以看出，判断正确性往往相当复杂。

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

二、大模型进行信息抽取的开放数据与多样化promot

开源项目https://github.com/zjunlp/DeepKE/blob/main/example/llm/README_CN.md中给出了一个很好的一些实操案例，对于利用大模型进行知识图谱信息抽取和构建经验的培养，有直接收益。

1、数据集

InstructIE-train ：https://pan.baidu.com/s/1xXVrjkinw4cyKKFBR8BwQw?pwd=x4s7，30w+，InstructIE训练集，基于弱监督构建得到，包含一定程度的噪音数据。

InstructIE-valid ：https://pan.baidu.com/s/11u_f_JT30W6B5xmUPC3enw?pwd=71ie，2000+，InstructIE验证集

InstructIE-test，https://pan.baidu.com/s/1JiRiOoyBVOold58zY482TA?pwd=cyr9，2000+，InstructIE测试集
train.json, valid.json，https://drive.google.com/file/d/1vfD4xgToVbCrFP2q-SD7iuRT2KWubIv9/view?usp=sharing，5000，https://tianchi.aliyun.com/competition/entrance/532080/introduction中的初赛训练集及测试集。

数据样例：

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

各字段的说明：

字段说明id唯一标识符cate文本input对应的主题(共12种)input模型输入文本（需要抽取其中涉及的所有关系三元组）instruction模型进行抽取任务的指令output模型期望输出entity实体(entity, entity_type)relationinput中涉及的关系三元组(head, relation, tail)

2、一些多样化的信息抽取prompt

项目地址：https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/configs/中提供了不同任务的prompt模版，并且具备多样性，这个很有价值。

例如：

2）实体识别

实体识别，旨在提取某种类型的实体名称。

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

3）关系抽取 关系抽取，旨在提取满足某种实体关系的三元组

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

4）事件检测

事件检测旨在检测文本中提及到得到事件名称。

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

5）事件论元提取

事件论元提取，根据提供的事件论元结构，来提取事件要素。也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

总结

本文主要介绍了两个事情，一个是langchain的年终总结，从中我们可以到一些信息。另一个是当前知识图谱构建中的一些训练数据以及可用的prompt，这些任务型数据的设计，很有意义，无论是入门知识图谱的，还是做知识图谱研究的，都可以使用，大家可以利用起来。

参考文献

1、https://blog.langchain.dev/langchain-state-of-ai-2023/

2、https://github.com/zjunlp/DeepKE/blob/main/example/llm/README_CN.md

关于我们

老刘，刘焕勇，NLP开源爱好者与践行者，主页：https://liuhuanyong.github.io。

老刘说NLP，将定期发布语言资源、工程实践、技术总结等内容，欢迎关注。

对于想加入更优质的知识图谱、事件图谱、大模型AIGC实践、相关分享的，可关注公众号，在后台菜单栏中点击会员社区->会员入群加入。

2023 年 12 月
一	二	三	四	五	六	日
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

ufabet มีเกมให้เลือกเล่นมากมาย: เกมเดิมพันหลากหลาย ครบทุกค่ายดัง

tornado crypto mixer Discover the power of privacy with TornadoCash! Learn how this decentralized mixer ensures your transactions remain confidential.

ดูบอลสด Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

ดูบอลสด Pretty! This has been a really wonderful post. Many thanks for providing these details.

ดูบอลสด Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

Obrazy Sztuka Nowoczesna Thank you for this wonderful contribution to the topic. Your ability to explain complex ideas simply is admirable.

ufabet Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

ufabet You’re so awesome! I don’t believe I have read a single thing like that before. So great to find someone with some original thoughts on this topic. Really.. thank you for starting this up. This website is something that is needed on the internet, someone with a little originality!

ufabet Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

也看面向知识图谱构建的大模型微调多样化prompt：兼看RAG框架Langchain的2023年终总结

一、从langchain的2023报告看一些主流数据‍‍‍‍

二、大模型进行信息抽取的开放数据与多样化promot

总结

参考文献

关于我们

test

test

文心AIGC

test

test