LG – 机器学习 CV – 计算机视觉 CL – 计算与语言 RO – 机器人

348次阅读

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言 RO – 机器人

1、[LG] LaMPP: Language Models as Probabilistic Priors for Perception and Action
2、[LG] Identifiability of latent-variable and structural-equation models: from linear to nonlinear
3、[LG] Efficient Online Reinforcement Learning with Offline Data
4、[RO] Zero-Shot Robot Manipulation from Passive Human Videos
5、[LG] How Many and Which Training Points Would Need to be Removed to Flip this Prediction?
[LG] ResMem: Learn what you can and memorize the rest
[CL] Data Selection for Language Models via Importance Resampling
[LG] Differentiable Programming of Chemical Reaction Networks
[CL] What Matters In The Structured Pruning of Generative Language Models?

摘要：语言模型作为感知和行动的概率先验、潜变量和结构方程模型的可识别性、使用离线数据进行高效的在线强化学习、基于被动人工视频的零样本机器人操作、推翻预测最少要删除哪些训练点、能学的学剩下的靠记、基于重要性重采样的语言模型数据选择、化学反应网络的可微编程、生成式语言模型的结构化修剪中什么最重要

1、[LG] LaMPP: Language Models as Probabilistic Priors for Perception and Action

B Z. Li, W Chen, P Sharma, J Andreas
[MIT CSAIL]

LaMPP: 语言模型作为感知和行动的概率先验

要点:

LaMPP(Language Models as Probabilistic Priors)用语言模型为非语言决策任务提供概率背景知识；
LaMPP 将来自语言的不确定背景知识，与非语言变量的结构化不确定性相结合，以提高各种任务的性能，如语义图像分割、机器人导航和视频动作识别；
LaMPP 为整合语言监督提供了一种灵活和通用的技术，并使不确定感知和含噪常识和领域先验的组合成为可能。

一句话总结:
提出 LAMPP(语言模型作为概率先验)，一种通过从语言模型中提取概率先验，将语言背景知识整合到决策问题中的技术，提高了不同任务的通用性，并使不确定感知和含噪常识和领域先验的原则性组合成为可能。

摘要：
在大型文本语料库上训练的语言模型，编码了关于真实世界环境和动作序列的丰富分布信息。这些信息在当前的语言处理任务(如问答和指令生成)中起着关键作用。本文描述了如何利用语言模型来完成非语言的感知和控制任务。所提出方法将标签和决策作为概率图模型的推理，其中语言模型对标签、决策和参数的先验分布进行了参数化，使其有可能以一种原则性的方式整合不确定的观察和不完整的背景知识。当应用于语义分割、家庭导航和活动识别任务时，该方法改善了对罕见的、非分布的和结构新颖的输入的预测。

Language models trained on large text corpora encode rich distributional information about real-world environments and action sequences. This information plays a crucial role in current approaches to language processing tasks like question answering and instruction generation. We describe how to leverage language models for non-linguistic perception and control tasks. Our approach casts labeling and decision-making as inference in probabilistic graphical models in which language models parameterize prior distributions over labels, decisions and parameters, making it possible to integrate uncertain observations and incomplete background knowledge in a principled way. Applied to semantic segmentation, household navigation, and activity recognition tasks, this approach improves predictions on rare, out-of-distribution, and structurally novel inputs.

https://arxiv.org/abs/2302.02801

2、[LG] Identifiability of latent-variable and structural-equation models: from linear to nonlinear

A Hyvärinen, I Khemakhem, R Monti
[University of Helsinki & UCL]

潜变量和结构方程模型的可识别性：从线性到非线性

要点:

多变量统计中线性高斯模型的可识别性问题，以及非高斯性如何解决该问题；
潜变量的非高斯性已被证明可以提供线性模型的可识别性，如因子分析和线性回归；
如果时间序列或分布被观察到的辅助变量适当地调制，甚至此类模型的非参数非线性版本也可以被估计；
回顾了线性和非线性模型的可识别性理论，包括因子分析和结构方程模型，并讨论了非线性ICA在使非线性SEM的可识别性和估计中的作用。

一句话总结:
回顾了线性高斯模型的可识别性问题，并展示了非高斯性如何能提供此类模型的可识别性，探讨了如何估计此类模型的非参数非线性版本，并讨论了非线性ICA在实现非线性结构方程模型的可识别性和估计方面的作用。

摘要：
多变量统计中的一个老问题，就是线性高斯模型往往是不可识别的，某些参数无法被唯一地估计。在因子分析中，因子的正交旋转是无法识别的，而在线性回归中，效应的方向无法识别。对于这样的线性模型，(潜在)变量的非高斯性已经被证明可以提供可识别性。在因子分析的情况下，这导致了独立成分分析，而在效应方向的情况下，结构方程模型的非高斯版本解决了这个问题。最近，已经表明，即使是一般的非参数非线性版本的此类模型也可以被估计。这种情况下，仅有非高斯性是不够的，但假设有时间序列，或者分布被一些观察到的辅助变量适当地调制，这些模型是可识别的。本文回顾了线性和非线性情况下的可识别性理论，同时考虑了因子分析模型和结构方程模型。

An old problem in multivariate statistics is that linear Gaussian models are often unidentifiable, i.e. some parameters cannot be uniquely estimated. In factor analysis, an orthogonal rotation of the factors is unidentifiable, while in linear regression, the direction of effect cannot be identified. For such linear models, non-Gaussianity of the (latent) variables has been shown to provide identifiability. In the case of factor analysis, this leads to independent component analysis, while in the case of the direction of effect, non-Gaussian versions of structural equation modelling solve the problem. More recently, we have shown how even general nonparametric nonlinear versions of such models can be estimated. Non-Gaussianity is not enough in this case, but assuming we have time series, or that the distributions are suitably modulated by some observed auxiliary variables, the models are identifiable. This paper reviews the identifiability theory for the linear and nonlinear cases, considering both factor analytic models and structural equation models.

https://arxiv.org/abs/2302.02672

3、[LG] Efficient Online Reinforcement Learning with Offline Data

P J. Ball, L Smith, I Kostrikov, S Levine
[University of Oxford & UC Berkeley]

使用离线数据进行高效的在线强化学习

要点:

提出一种在在线强化学习（RL）中使用离线数据的方法，以解决样本效率和探索方面的挑战；
所提出方法称为RLPD(先验数据强化学习)，使用现有的 off-policy 方法，以最小的修改来实现可靠的性能，结果在各种竞争性基准上比现有方法提高了 2.5 倍；
在线强化学习中成功应用离线数据的关键设计选择，包括对称抽样、LayerNorm 作为值外推正则化器以及样本高效学习。

一句话总结:
提出一套对 off-policy 强化学习算法的最小但重要的改变，称为RLPD(先验数据强化学习)，在一组不同的基准中，导致比现有方法性能提高 2.5 倍，且没有额外的计算开销。

摘要：
采样效率和探索仍然是在线强化学习(RL)的主要挑战。可以应用于解决这些问题的强大方法是纳入离线数据，如来自人工专家的先前轨迹或次优探索策略。之前的方法依赖于扩展的修改和额外的复杂性，以确保有效利用这些数据。本文要问的是：能否简单应用现有的非策略方法，在在线学习时利用离线数据？本文证明答案是肯定的；然而，要实现可靠的性能，需要对现有的 off-policy 强化学习算法进行一系列微小但重要的修改。本文广泛地阐述了这些设计选择，展示了对性能影响最大的关键因素，并得出了一套从业者可以轻易应用的建议，无论离线数据是由少量专家演示还是大量的次优轨迹组成。可以看到，正确应用这些简单的建议，可以在一组不同的竞争性基准中提供比现有方法2.5倍的改进，且没有额外的计算开销。

Sample efficiency and exploration remain major challenges in online reinforcement learning (RL). A powerful approach that can be applied to address these issues is the inclusion of offline data, such as prior trajectories from a human expert or a sub-optimal exploration policy. Previous methods have relied on extensive modifications and additional complexity to ensure the effective use of this data. Instead, we ask: can we simply apply existing off-policy methods to leverage offline data when learning online? In this work, we demonstrate that the answer is yes; however, a set of minimal but important changes to existing off-policy RL algorithms are required to achieve reliable performance. We extensively ablate these design choices, demonstrating the key factors that most affect performance, and arrive at a set of recommendations that practitioners can readily apply, whether their data comprise a small number of expert demonstrations or large volumes of sub-optimal trajectories. We see that correct application of these simple recommendations can provide a 2.5× improvement over existing approaches across a diverse set of competitive benchmarks, with no additional computational overhead.

https://arxiv.org/abs/2302.02948

4、[RO] Zero-Shot Robot Manipulation from Passive Human Videos

H Bharadhwaj, A Gupta, S Tulsiani, V Kumar
[CMU & Meta AI]

基于被动人工视频的零样本机器人操作

要点:

提出一种框架，从人工视频中提取与智能体无关的动作表征；
将提取的动作表征与机器人零样本操作任务中的智能体的具身进行映射；
在无法获得域内机器人操纵轨迹的情况下，成功执行了粗略操纵任务，如开合抽屉、推和使用工具等。

一句话总结:
用从人工视频中提取的与智能体无关的动作表示并将其映射到智能体具身的零样本机器人操作框架，在无法访问域内机器人操作轨迹的情况下成功执行粗略操作任务，该模型的目标条件版本可以生成动作轨迹来达到指定目标。

摘要：
仅通过观看人类在不同的非结构化环境中完成任意任务的视频，能让机器人学会对日常任务的操作吗？与广泛采用的学习特定任务行为或直接模仿人类视频的策略不同，本文提出一种框架，用于从人工视频中提取与智能体无关的动作表征，然后在部署过程中将其映射到智能体的具身上。给定一个场景的初始图像，该框架基于对合理的人类手部轨迹的预测。在对互联网上的各种人工视频进行训练后，将训练好的模型零样本部署到物理机器人的操纵任务上，然后对机器人的具身进行适当的转换。这种简单的策略使得能解决粗略的操纵任务，如打开和关闭抽屉、推和使用工具，而不需要获得任何域内机器人操纵轨迹。在现实世界中的部署结果为行动预测信息建立了一个强大的基线，这些信息可以从人类活动的任意视频中获取，并对未见场景中的零样本机器人操纵非常有用。

Can we learn robot manipulation for everyday tasks, only by watching videos of humans doing arbitrary tasks in different unstructured settings? Unlike widely adopted strategies of learning task-specific behaviors or direct imitation of a human video, we develop a a framework for extracting agent-agnostic action representations from human videos, and then map it to the agent’s embodiment during deployment. Our framework is based on predicting plausible human hand trajectories given an initial image of a scene. After training this prediction model on a diverse set of human videos from the internet, we deploy the trained model zero-shot for physical robot manipulation tasks, after appropriate transformations to the robot’s embodiment. This simple strategy lets us solve coarse manipulation tasks like opening and closing drawers, pushing, and tool use, without access to any in-domain robot manipulation trajectories. Our real-world deployment results establish a strong baseline for action prediction information that can be acquired from diverse arbitrary videos of human activities, and be useful for zero-shot robotic manipulation in unseen scenes.

https://arxiv.org/abs/2302.02011

5、[LG] How Many and Which Training Points Would Need to be Removed to Flip this Prediction?

J Yang, S Jain, B C. Wallace
[The University of Hong Kong & AWS AI Labs & Northeastern University]

推翻预测最少要删除哪些训练点？

要点:

提出一种新的方法，用于识别训练数据的最小子集，如果移除该子集，将导致给定测试点的预测翻转，这对于衡量模型的鲁棒性和对模型预测的质疑可能很有用；
提出两种有效的算法来识别这些最小的训练集，以影响函数为基础，并在文本二分类的背景下证明了其有效性；
提供了经验证据，证明该方法以补充预测概率的方式捕捉不确定性，并可能帮助个人审查和争议已确定的最小训练集中的实例。

一句话总结:
所提出最小训练子集的方法，可帮助提高模型的鲁棒性，并为模型预测的质疑提供一个新的机制。

摘要：
本文考虑的问题是识别训练数据的最小子集，这样，如果在训练前删除组成最小子集的实例，给定测试点的分类就会不同。识别这样一个集合可能是有意义的，原因有几个。首先，最小子集的 cardinality 提供了一个鲁棒性的衡量标准（如果最小子集大小对测试点集来说很小，可能对相应的预测不太有信心），它与预测的概率相关，但也是互补的。第二，对最小子集的询问可能为质疑某个特定模型的预测提供一种新的机制。如果能够证明最小子集中的点是错误的标记或不相关的，这可能会主张推翻相关预测。通过暴力手段识别最小子集是难以做到的。本文提出了相对快速的近似方法来寻找基于影响函数的最小子集，并发现——对于简单的凸文本分类模型——这些方法通常可以成功地识别相对较小的训练实例集，如果将其移除，将翻转预测。

We consider the problem of identifying a minimal subset of training data t such that if the instances comprising t had been removed prior to training, the categorization of a given test point xt would have been different. Identifying such a set may be of interest for a few reasons. First, the cardinality of t provides a measure of robustness (if |t| is small for xt, we might be less confident in the corresponding prediction), which we show is correlated with but complementary to predicted probabilities. Second, interrogation of t may provide a novel mechanism for contesting a particular model prediction: If one can make the case that the points in t are wrongly labeled or irrelevant, this may argue for overturning the associated prediction. Identifying t via brute-force is intractable. We propose comparatively fast approximation methods to find t based on influence functions, and find that — for simple convex text classification models — these approaches can often successfully identify relatively small sets of training examples which, if removed, would flip the prediction.

https://arxiv.org/abs/2302.02169

另外几篇值得关注的论文：

[LG] ResMem: Learn what you can and memorize the rest

Z Yang, M Lukasik, V Nagarajan, Z Li, A S Rawat, M Zaheer, A K Menon, S Kumar
[Stanford University & Google Research]

ResMem: 能学的学剩下的靠记

要点:

提出名为 ResMem 的新算法，通过使用基于K-近邻的回归器显式记忆训练标签来提高模型的泛化能力；
ResMem 持续改善了神经网络的测试性能，特别是当训练集很大时，理论分析表明，ResMem可以改善基础预测模型；
ResMem 算法涉及用相对简单的K-近邻组件来增强基础预测器，这比扩大模型规模更有效。

一句话总结:
ResMem 使用基于K-近邻的回归器，改善了神经网络的测试性能，特别是在训练集极其庞大的情况下，它在一个风格化的线性回归问题中取得了比基础预测器更有利的测试风险。

The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model’s residuals with a k-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor.

https://arxiv.org/abs/2302.01576

[CL] Data Selection for Language Models via Importance Resampling

S M Xie, S Santurkar, T Ma, P Liang
[Stanford University]

基于重要性重采样的语言模型数据选择

要点:

DSIR是一种高效且可扩展的算法，可以提高通用域模型的下游性能，并且在面向特定域的持续预训练中表现得与专家策划的数据相当；
KL缩减是衡量所选数据与目标在特征空间中的接近程度的数据指标，与下游的准确性有很高的相关性，可以实现以数据为中心的新工作流程；
寻找合适的特征空间和重要性权重估计器的参数化对DSIR很重要，未来的工作可以探索其他方法；

一句话总结:
提出一种高效可扩展的数据选择框架，即”重要性重采样数据选择”(DSIR)，通过选择符合预期目标分布的原始文本数据子集来提高语言模型(LM)的下游性能，在一个缩小的特征空间中使用重要性重采样来估计重要性权重，并选择对提高LM性能有重要意义的数据。

Selecting a suitable training dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this data selection problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution, given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics to select data that are similar to a high-quality reference corpus (e.g., Wikipedia), or leverage experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. Crucially, we work in a reduced feature space to make importance weight estimation tractable over the space of text. To determine an appropriate feature space, we first show that KL reduction, a data metric that measures the proximity between selected data and the target in a feature space, has high correlation with average accuracy on 8 downstream tasks (r=0.89) when computed with simple n-gram features. From this observation, we present Data Selection with Importance Resampling (DSIR), an efficient and scalable algorithm that estimates importance weights in a reduced feature space (e.g., n-gram features in our instantiation) and selects data with importance resampling according to these weights. When training general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2–2.5% on the GLUE benchmark. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curated data across 8 target distributions.

https://arxiv.org/abs/2302.03169

[LG] Differentiable Programming of Chemical Reaction Networks

A Mordvintsev, E Randazzo, E Niklasson
[Google Research]

化学反应网络的可微编程

要点:

化学反应网络是自然使用的最基本的计算基质之一；
所提出的方法使用可微优化来设计紧凑的稀疏化学反应网络以解决各种计算问题；
与之前的工作不同，之前工作侧重于构建传统基本计算单元的化学对应物，并手工将它们组合成电路；
CRN的可微编程可能使反应回路的高效设计成为可能，这些回路可以在各种物理基质上实现。

一句话总结:
提出化学反应网络的可微表述，可通过优化来求解计算任务。可微调优化与适当的正则化相结合，可以发现非平凡的稀疏反应网络，实现各种振子和其他化学计算设备。

We present a differentiable formulation of abstract chemical reaction networks (CRNs) that can be trained to solve a variety of computational tasks. Chemical reaction networks are one of the most fundamental computational substrates used by nature. We study well-mixed single-chamber systems, as well as systems with multiple chambers separated by membranes, under mass-action kinetics. We demonstrate that differentiable optimisation, combined with proper regularisation, can discover non-trivial sparse reaction networks that can implement various sorts of oscillators and other chemical computing devices.

https://arxiv.org/abs/2302.02714

[CL] What Matters In The Structured Pruning of Generative Language Models?

M Santacroce, Z Wen, Y Shen, Y Li
[Microsoft & CMU]

生成式语言模型的结构化修剪中什么最重要？

要点:

随机剪枝对于减少生成式语言模型资源使用有出乎意料的功效，可与现有剪枝方法相媲美；
现有的结构化剪枝方法没有考虑到神经元的独特性，留下了多余的冗余；
提出一个框架，从独特性和显著性的角度分析剪枝方法，也是保持模型质量和多样性的两个重要标准；
提出一种新的剪枝方法，即GUM，在几个自然语言生成任务上实现了有竞争力的压缩率并优于现有方法。

一句话总结:
对GPT型生成语言模型前馈层的结构化修剪方法进行了综合评价，包括幅度修剪、随机修剪和运动修剪，提出一种新的修剪方法——全局唯一运动(GUM)，旨在通过根据网络组件的全局运动和局部唯一性分数来进行修剪，以最大限度提高敏感性和唯一性。

Auto-regressive large language models such as GPT-3 require enormous computational resources to use. Traditionally, structured pruning methods are employed to reduce resource usage. However, their application to and efficacy for generative language models is heavily under-explored. In this paper we conduct an comprehensive evaluation of common structured pruning methods, including magnitude, random, and movement pruning on the feed-forward layers in GPT-type models. Unexpectedly, random pruning results in performance that is comparable to the best established methods, across multiple natural language generation tasks. To understand these results, we provide a framework for measuring neuron-level redundancy of models pruned by different methods, and discover that established structured pruning methods do not take into account the distinctiveness of neurons, leaving behind excess redundancies. In view of this, we introduce Globally Unique Movement (GUM) to improve the uniqueness of neurons in pruned models. We then discuss the effects of our techniques on different redundancy metrics to explain the improved performance.

https://arxiv.org/abs/2302.03773

正文完

可以使用微信扫码关注公众号（ID：xzluomor）