1、[RO] Scaling Robot Learning with Semantically Imagined Experience
2、[LG] Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC
3、[CV] Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
4、[CL] How Does In-Context Learning Help Prompt Tuning?
5、[LG] Modular Deep Learning
[CL] Guiding Large Language Models via Directional Stimulus Prompting
[LG] Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron
[CL] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
[LG] Optical Transformers
摘要:用语义想象经验扩大机器人学习规模、用基于能量的扩散模型和MCMC进行组合生成、开放域视觉实体识别、上下文学习如何有助于提示微调、模块化深度学习、通过定向刺激提示指导大型语言模型、单个神经元学习的过参数化指数级减慢梯度下降、自然语言生成中不确定性估计的语言不变性、光学Transformer
1、[RO] Scaling Robot Learning with Semantically Imagined Experience
T Yu, T Xiao, A Stone, J Tompson, A Brohan, S Wang, J Singh, C Tan, D M, J Peralta, B Ichter, K Hausman, F Xia
[Google]
用语义想象经验扩大机器人学习规模
要点:
-
ROSIE 使用文本到图像模型来增强机器人学习数据集,而无需额外的真实世界数据; -
增强的内容包括有语义的背景、新任务和干扰; -
在 ROSIE 增强的数据上训练的策略可以解决未见过的任务,并且对干扰和背景更加鲁棒; -
ROSIE 还提高了机器人学习中成功检测的鲁棒性,特别是在分布外的情况下。
一句话总结:
ROSIE 系统利用文本到图像模型扩大了机器人的学习能力,提高了鲁棒性,且不需要额外的真实世界数据。
Recent advances in robot learning have shown promise in enabling robots to perform a variety of manipulation tasks and generalize to novel scenarios. One of the key contributing factors to this progress is the scale of robot data used to train the models. To obtain large-scale datasets, prior approaches have relied on either demonstrations requiring high human involvement or engineering-heavy autonomous data collection schemes, both of which are challenging to scale. To mitigate this issue, we propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing to obtain meaningful data for robot learning without requiring additional robot data. We term our method Robot Learning with Semantically Imagened Experience (ROSIE). Specifically, we make use of the state of the art text-to-image diffusion models and perform aggressive data augmentation on top of our existing robotic manipulation datasets via inpainting various unseen objects for manipulation, backgrounds, and distractors with text guidance. Through extensive real-world experiments, we show that manipulation policies trained on data augmented this way are able to solve completely unseen tasks with new objects and can behave more robustly w.r.t. novel distractors. In addition, we find that we can improve the robustness and generalization of high-level robot learning tasks such as success detection through training with the diffusion-based data augmentation. The project’s website and videos can be found at this http URL
https://arxiv.org/abs/2302.11550




2、[LG] Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC
Y Du, C Durkan, R Strudel, J B. Tenenbaum, S Dieleman, R Fergus…
[MIT & Deepmind & Google Brain & INRIA]
减少、再利用、回收:用基于能量的扩散模型和MCMC进行组合生成
要点:
-
扩散模型在生成式建模中很受欢迎,但在组合生成任务中可能会失败; -
MCMC 衍生采样和基于能量的参数化,可以通过启用新方法来组合扩散模型和更强大的采样器来改善组合生成; -
该方法导致在各种域、规模和组合运算符中的明显改进; -
该方法的有效性在从 2D 数据到高分辨率文本到图像生成的设置中得到了证明。
一句话总结:
提出组合和重用扩散模型的新方法,用于组合生成和引导,通过 MCMC 衍生采样和基于能量的参数化提高性能。
Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide set of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.
https://arxiv.org/abs/2302.11552




3、[CV] Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
H Hu, Y Luan, Y Chen, U Khandelwal, M Joshi, K Lee, K Toutanova, M Chang
[Google Research]
开放域视觉实体识别:数百万维基百科实体识别研究
要点:
-
大规模预训练模型在开放域视觉识别任务中表现良好; -
OVEN-Wiki 数据集挑战模型在 600 万个可能的维基百科实体中进行识别,使其成为一个通用的视觉识别基准; -
基于 PaLI 的自回归视觉识别模型表现出奇的好,甚至在微调期间从未见过的维基百科实体上; -
基于 CLIP 的模型在识别长尾实体方面更出色,而基于 PaLI 的模型获得了更高的整体性能。
一句话总结:
大规模预训练模型可以在识别开放域视觉概念上表现良好,正如 OVEN-Wiki 数据集所显示的,其挑战模型在600万个可能的维基百科实体中进行识别。
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong generalization on various visual domains and tasks. However, existing image classification benchmarks often evaluate recognition on a specific domain (e.g., outdoor images) or a specific task (e.g., classifying plant species), which falls short of evaluating whether pre-trained foundational models are universal visual recognizers. To address this, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels. Our study on state-of-the-art pre-trained models reveals large headroom in generalizing to the massive-scale label space. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning. We also find existing pretrained models yield different strengths: while PaLI-based models obtain higher overall performance, CLIP-based models are better at recognizing tail entities.
https://arxiv.org/abs/2302.11154




4、[CL] How Does In-Context Learning Help Prompt Tuning?
S Sun, Y Liu, D Iter, C Zhu, M Iyyer
[University of Massachusetts Amherst & Microsoft Research]
上下文学习如何有助于提示微调?
要点:
-
在语言生成任务中,提示微调和指令提示微调都优于上下文学习; -
指令提示微调比提示微调更稳定,并且在改变超参数时产生较低的方差; -
当上下文演示与测试输入非常相似时,指令提示微调优于提示微调; -
提示微调表现出较高的方差,但将提示微调与上下午学习结合起来可以减少方差,并且对提示嵌入数量依赖性较小。
一句话总结:
上下文学习可以改善语言生成任务的提示微调,但不同方法的有效性取决于任务和实验配置。
Fine-tuning large language models is becoming ever more impractical due to their rapidly-growing scale. This motivates the use of parameter-efficient adaptation methods such as prompt tuning (PT), which adds a small number of tunable embeddings to an otherwise frozen model, and in-context learning (ICL), in which demonstrations of the task are provided to the model in natural language without any additional training. Recently, Singhal et al. (2022) propose “instruction prompt tuning” (IPT), which combines PT with ICL by concatenating a natural language demonstration with learned prompt embeddings. While all of these methods have proven effective on different tasks, how they interact with each other remains unexplored. In this paper, we empirically study when and how in-context examples improve prompt tuning by measuring the effectiveness of ICL, PT, and IPT on five text generation tasks with multiple base language models. We observe that (1) IPT does emph{not} always outperform PT, and in fact requires the in-context demonstration to be semantically similar to the test input to yield improvements; (2) PT is unstable and exhibits high variance, but combining PT and ICL (into IPT) consistently reduces variance across all five tasks; and (3) prompts learned for a specific source task via PT exhibit positive transfer when paired with in-context examples of a different target task. Our results offer actionable insights on choosing a suitable parameter-efficient adaptation method for a given task.
https://arxiv.org/abs/2302.11521




5、[LG] Modular Deep Learning
J Pfeiffer, S Ruder, I Vulić, E M Ponti
[Google Research & University of Cambridge]
模块化深度学习
要点:
-
模块化深度学习通过将计算与路由分离,并在本地更新模块来实现正向迁移和系统性泛化; -
该框架由自主参数高效模块组成,信息有条件地被路由到模块的子集,随后被聚合; -
模块化还有其他用途,包括扩展语言模型、因果推理、程序归纳和强化学习中的规划; -
模块化深度学习已经成功地部署在具体的应用中,如跨语言和跨模态的知识转移。
一句话总结:
模块化深度学习提供了一种很有前途的解决方案,可以开发出专门针对多个任务而没有负面干扰的模型,并且可以系统性泛化到非相同分布的任务。
Transfer learning has recently become the dominant paradigm of machine learning. Pre-trained models fine-tuned for downstream tasks achieve better performance with fewer labelled examples. Nonetheless, it remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference and that generalise systematically to non-identically distributed tasks. Modular deep learning has emerged as a promising solution to these challenges. In this framework, units of computation are often implemented as autonomous parameter-efficient modules. Information is conditionally routed to a subset of modules and subsequently aggregated. These properties enable positive transfer and systematic generalisation by separating computation from routing and updating modules locally. We offer a survey of modular architectures, providing a unified view over several threads of research that evolved independently in the scientific literature. Moreover, we explore various additional purposes of modularity, including scaling language models, causal inference, programme induction, and planning in reinforcement learning. Finally, we report various concrete applications where modularity has been successfully deployed such as cross-lingual and cross-modal knowledge transfer. Related talks and projects to this survey, are available at this https URL.
https://arxiv.org/abs/2302.11529




另外几篇值得关注的论文:
[CL] Guiding Large Language Models via Directional Stimulus Prompting
Z Li, B Peng, P He, M Galley, J Gao, X Yan
[Microsoft & University of California, Santa Barbara]
通过定向刺激提示指导大型语言模型
要点:
-
该框架使用一个策略语言模型来生成离散Token,作为指导大型语言模型的方向性刺激; -
生成的刺激与原始输入相结合,并输入到大型语言模型中,以引导其向特定目标生成; -
策略语言模型可以通过监督学习和强化学习进行训练,以探索方向性刺激,使大型语言模型与人工偏好更好地结合起来; -
该框架适用于各种语言模型和任务,只需收集少量的训练数据就能显著提高性能。
一句话总结:
定向刺激提示是一个新框架,用可调语言模型为下游任务的黑盒冻结大型语言模型提供指导,用少量训练数据集合即可提高性能。
We introduce a new framework, Directional Stimulus Prompting, that uses a tuneable language model (LM) to provide guidance for the black-box frozen large language model (LLM) on downstream tasks. Unlike prior work that manually or automatically finds the optimal prompt for each task, we train a policy LM to generate discrete tokens as “directional stimulus” of each input, which is a hint/cue such as keywords of an article for summarization. The directional stimulus is then combined with the original input and fed into the LLM to guide its generation toward the desired target. The policy LM can be trained through 1) supervised learning from annotated data and 2) reinforcement learning from offline and online rewards to explore directional stimulus that better aligns LLMs with human preferences. This framework is flexibly applicable to various LMs and tasks. To verify its effectiveness, we apply our framework to summarization and dialogue response generation tasks. Experimental results demonstrate that it can significantly improve LLMs’ performance with a small collection of training data: a T5 (780M) trained with 2,000 samples from the CNN/Daily Mail dataset improves Codex (175B)’s performance by 7.2% in ROUGE-Avg scores; 500 dialogues boost the combined score by 52.5%, achieving comparable or even better performance than fully trained models on the MultiWOZ dataset.
https://arxiv.org/abs/2302.11520




[LG] Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron
W Xu, S S. Du
[Tsinghua University & University of Washington]
单个神经元学习的过参数化指数级减慢梯度下降
要点:
-
本文重新讨论了在高斯输入下,学生网络有n≥2个神经元的情况下,用ReLU激活的单个神经元学习的过参数化设置; -
随机初始化梯度下降以 O(T-3) 的速率全局收敛,这是该问题超越精确参数化设置的第一个全局收敛结果; -
提出了在过参数化环境下随机初始化梯度流的 Ω(T-3) 下界,首次暗示了过参数化会以指数方式减慢收敛率; -
为了证明收敛率下界,构建了一个新的势函数,描述了学生神经元之间的成对距离,这在精确参数化情况下是做不到的。
一句话总结:
对于在高斯输入和平方损失下学习具有 ReLU 激活的单个神经元来说,过参数化会以指数级减慢梯度下降。
We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has n≥2 neurons. We prove the global convergence of randomly initialized gradient descent with a O(T−3) rate. This is the first global convergence result for this problem beyond the exact-parameterization setting (n=1) in which the gradient descent enjoys an exp(−Ω(T)) rate. Perhaps surprisingly, we further present an Ω(T−3) lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that over-parameterization can exponentially slow down the convergence rate. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD’s dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
https://arxiv.org/abs/2302.10034


[CL] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
L Kuhn, Y Gal, S Farquhar
[University of Oxford]
语义不确定性: 自然语言生成中不确定性估计的语言不变性
要点:
-
由 “语义对等”,自然语言不确定性的度量具有挑战性; -
语义熵作为一种熵被引入,其包含了由共同意义产生的语言不变性; -
该方法是无监督的,只使用一个单一的模型,且不需要修改既有的语言模型; -
在综合消融研究中,语义熵被证明比可比基线更能预测问答数据集的模型准确性。
一句话总结:
提出一种用语义熵来衡量自然语言生成模型不确定性的新方法。
We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of “semantic equivalence” — different sentences can mean the same thing. To overcome these challenges we introduce semantic entropy — an entropy which incorporates linguistic invariances created by shared meanings. Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf language models. In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.
https://arxiv.org/abs/2302.09664




[LG] Optical Transformers
M G. Anderson, S Ma, T Wang, L G. Wright, P L. McMahon
[Cornell University]
光学Transformer
要点:
-
光学矩阵-向量乘法器最适合于具有非常大的操作数的计算; -
尽管有噪声和错误,Transformer 仍可在光学硬件上运行; -
每MAC的光能以 1d 形式扩展,其中 d 是 Transformer 的宽度,提供了比数字系统更多的优势; -
有了设计良好的大规模光学硬件,就有可能比达到 300 fJ/MAC 的最先进的数字电子处理器获得超过 8,000 倍的能源效率优势。
一句话总结:
大型 Transformer 模型通过在设计良好的大型光学硬件上运行,可以获得明显的能效优势。
The rapidly increasing size of deep-learning models has caused renewed and growing interest in alternatives to digital computers to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for optical computing. To test this idea, we performed small-scale optical experiments with a prototype accelerator to demonstrate that Transformer operations can run on optical hardware despite noise and errors. Using simulations, validated by our experiments, we then explored the energy efficiency of optical implementations of Transformers and identified scaling laws for model performance with respect to optical energy usage. We found that the optical energy per multiply-accumulate (MAC) scales as 1d where d is the Transformer width, an asymptotic advantage over digital systems. We conclude that with well-engineered, large-scale optical hardware, it may be possible to achieve a 100× energy-efficiency advantage for running some of the largest current Transformer models, and that if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical computers could have a >8,000× energy-efficiency advantage over state-of-the-art digital-electronic processors that achieve 300 fJ/MAC. We analyzed how these results motivate and inform the construction of future optical accelerators along with optics-amenable deep-learning approaches. With assumptions about future improvements to electronics and Transformer quantization techniques (5× cheaper memory access, double the digital–analog conversion efficiency, and 4-bit precision), we estimated that optical computers’ advantage against current 300-fJ/MAC digital processors could grow to >100,000×.
https://arxiv.org/abs/2302.10360



