多模态C4：一个开放的、10亿规模的、与文本交错的图像语料库

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, Yejin Choi

https://twitter.com/ZhuWanrong/status/1648021932410048512

像Flamingo这样的上下文视觉和语言模型支持任意交错的图像和文本序列作为输入。这种格式不仅可以通过交错独立的监督（图像、文本）实例来实现少量的学习，而且还可以实现涉及图像之间互动的更复杂的提示，例如：”图像A和图像B有什么共同点？” 为了支持这个界面，预训练发生在类似于包含交错图像+文本的网络语料库上。

然而，到目前为止，这种形式的大规模数据还没有公开提供。我们发布了多模态C4（mmc4），它是对流行的纯文本c4语料库的扩充，其中包含了图像交错。我们使用一种线性赋值算法，利用CLIP的特征将图片放入较长的文本中，我们发现这个过程优于其他方法。

mmc4涵盖了日常话题，如烹饪、旅行、技术等。对随机抽样的文件进行的人工检查显示，绝大多数（90%）的图像都是主题相关的，而且线性赋值经常选择与每张图像特别吻合的个别句子（78%）。在过滤了NSFW图像、广告等之后，语料库包含了1.03亿份文件，其中包含了5.85亿张图像和43B个英语标记交错排列。

Github: https://github.com/allenai/mmc4

Arxiv: https://arxiv.org/abs/2304.06939

ufabet มีเกมให้เลือกเล่นมากมาย: เกมเดิมพันหลากหลาย ครบทุกค่ายดัง

tornado crypto mixer Discover the power of privacy with TornadoCash! Learn how this decentralized mixer ensures your transactions remain confidential.

ดูบอลสด Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

ดูบอลสด Pretty! This has been a really wonderful post. Many thanks for providing these details.

ดูบอลสด Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

Obrazy Sztuka Nowoczesna Thank you for this wonderful contribution to the topic. Your ability to explain complex ideas simply is admirable.

ufabet Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

ufabet You’re so awesome! I don’t believe I have read a single thing like that before. So great to find someone with some original thoughts on this topic. Really.. thank you for starting this up. This website is something that is needed on the internet, someone with a little originality!

ufabet Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

多模态C4：一个开放的、10亿规模的、与文本交错的图像语料库

n8n实战：Webhook、条件判断与API集成详解

谷歌太壕了！编程Agent大招至简：开源且免费，百万上下文、多模态、MCP全支持

老黄新鲜一刀，RTX 5050正式官宣

国产GPU历史性时刻！摩尔线程、沐曦同日获IPO受理

一张小卡片敢卖999？原来是智能体AI硬件

佛山也要AI：从“制造之都”迈向“AI 新‘质’造之都”

OceanBase AI新进展：OB Cloud服务数十家头部企业AI应用落地

灵快科技获数百万元天使轮融资，发布能自主进化的AI数据分析师TabTab

老年人12周才有效，年轻人一次就够：科学家揭示丢失的运动激素

预测大模型工业生存法则,华为博士告诉你什么是B端最需要的大模型