智源LIVE59期:当软工遇上NLP,代码大模型综述

631次阅读
没有评论

​近年来,基于 Transformer 的语言模型在自然语言处理中取得了巨大成功,而程序语言作为一种特殊的自然语言,也已被广泛使用语言模型进行建模。我们的工作对基于语言模型的代码处理与生成进行系统性调研,覆盖超过50个模型、30个下游任务、170个数据集,以及700篇相关工作。

我们系统梳理使用人工智能技术处理代码的历史 – 从 n-gram 到 RNN 到 Transformer,并深入讨论近期 NLP 与软工两个学科呈现出的交叉融合趋势。NLP 中的最新技术,包括指令微调、强化学习、数据工程以及模型架构的改进等,已经被广泛应用于代码处理,而软件工程中的的各下游任务也为大语言模型提出了新的挑战与应用机会。如何将程序语言独有的特征,包括抽象语法树、数据流、控制流、编译器中间表示等无缝融合进大语言模型中,是当下面临的一个关键挑战。

Transformer based language models have achieved huge success in natural language processing, and have been subsequently applied to code processing – a special kind of natural language. In this work we systematically review the recent advancements in code processing with language models, covering 50+ models, 30+ evaluations tasks, 170+ datasets, and 700 related works.

We review the history of AI application in code processing – from n-gram to RNN, and lately to Transformer, which is exactly the historical course of NLP. We provide a unified view between NLP and software engineering, observing that advanced topics from NLP have been recently introduced into code processing, including instruction tuning, reinforcement learning, data engineering, and architectural improvements, while downstream tasks in software engineering are in return posing challenges for LLMs and driving them forward into production. It remains a key challenge to seamlessly integrate code-specific features, such as abstract syntax tree, control flow graph, data flow graph, and intermediate representation into LLMs.

https://github.com/codefuse-ai/Awesome-Code-LLM

https://simg.baai.ac.cn/paperfile/e3d3c624-dc64-4fa1-9bdd-f67f2964781b.pdf

智源LIVE59期:当软工遇上NLP,代码大模型综述

张子殷,上海交通大学计算机系本科、硕士在读,主要研究自然语言处理方向,目前在蚂蚁集团学术实习。

 

Read More 

正文完
可以使用微信扫码关注公众号(ID:xzluomor)
post-qrcode
 
评论(没有评论)
Generated by Feedzy