open clip论文阅读摘要

看下open clip论文
Learning Transferable Visual Models From Natural Language Supervision
These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets.

CNNs trained to predict words in image captions learn useful image representations

learn image representations from text

我好奇,在OCR上是怎么测试的?
CLIP训练样本要怎么准备,400 million (image, text) pairs,这个量级的样本集是怎么准备出来的

论文说CLIP这种预训练,zero-shot可以媲美基于监督学习构建的模型,我需要打个问号,在特定领域的业务数据上好像不太够啊?

Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”

MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each

YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality
Many images use automatically generated filenames like 20160716 113957.JPG as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is
approximately the same size as ImageNet

A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet.

we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet

We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText
注重样本的类别平衡

we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric
是的,在这样规模的数据集上训练,需要的时间是令人畏惧的,所以掌握更快速的训练效率是关键

Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent
predictive objective
这个发现很有意思,这说明我们可以不需要准确预测每个图片的text caption,这太难了

Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance
这里又提到了生成模型,在学习表征方面,有监督学习CNN、对比学习CLIP、生成模型Stable Diffusion

We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights.
在这么一个大数据集上,甚至比ImageNet还大,加载预训练的ImageNet模型和text encoder模型确实没必要

CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot classification, we reuse this capability. For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text)
pair according to CLIP. In a bit more detail, we first compute the feature embedding of the image and the feature embedding of the set of possible texts by their respective encoders. The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter τ , and normalized into a
probability distribution via a softmax. Note that this prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling

Another issue we encountered is that it’s relatively rare in our pre-training dataset for the text paired with the image to be just a single word. Usually the text is a full sentence describing the image in some way. To help bridge this distribution gap, we found that using the prompt template “A photo of a {label}.” to be a good default that helps specify the text is about the content of the image. This often improves performance over the baseline of using only the label text
出现这个问题的原因是模型没能理解语言,不过现在GPT4可以做到了,估计会有点儿突破?
Similar to the “prompt engineering” discussion around GPT3 (Brown et al., 2020; Gao et al., 2020), we have also observed that zero-shot performance can be significantly improved by customizing the prompt text to each task. A few, non exhaustive, examples follow. We found on several fine-grained image classification datasets that it helped to specify the category. For example on Oxford-IIIT Pets, using “A photo of a {label}, a type of pet.” to help provide context worked well. Likewise, on Food101 specifying a type of food and on FGVC Aircraft a type of aircraft helped too. For OCR datasets, we found that putting quotes around the text or number to be recognized improved performance. Finally, we found that on satellite image classification datasets it helped to specify that the images were of this form and we use variants of “a satellite photo of a {label}.”.
这种prompt对于性能的提升是肯定的

We also experimented with ensembling over multiple zeroshot classifiers as another way of improving performance. These classifiers are computed by using different context prompts such as ‘A photo of a big {label}” and “A photo of a small {label}”. We construct the ensemble over the embedding space instead of probability space. This allows us to cache a single set of averaged text embeddings so that the compute cost of the ensemble is the same as using a single classifier when amortized over many predictions
这里使用emsemble的方法提升性能

说白了,比监督学习强在:
1、数据量更多
2、任务种类更多
3、加上文本学习语义信息,不单单是空间信息

we see that zero-shot CLIP is quite weak on several specialized, complex, or abstract tasks such as satellite image classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), counting objects in synthetic scenes (CLEVRCounts), self-driving related tasks such as
German traffic sign recognition (GTSRB), recognizing distance to the nearest car (KITTI Distance). These results highlight the poor capability of zero-shot CLIP on more complex tasks.
貌似跟GPT4也有点像?虽然通用性很不错,但是没办法做到全能,特别是复杂任务上,我感觉还是数据的问题吧,当然也有可能是现在的模型架构没办法应对复杂任务,所以需要拆解成更简单的子任务。不可否认的是在业务数据标注上存在加速作用

First, CLIP’s zero-shot classifier is generated via natural language which allows for visual concepts to be directly specified (“communicated”). By contrast,
“normal” supervised learning must infer concepts indirectly from training examples. Context-less example-based learning has the drawback that many different hypotheses can be consistent with the data, especially in the one-shot case. A single image often contains many different visual concepts. Although a capable learner is able to exploit visual cues and heuristics, such as assuming that the concept being demonstrated is the primary object in an image, there is no guarantee
是的,few-shot的难点主要在于,你不知道模型把什么特征跟最终的标签做了关联,所以需要加大样本数据量,才能使模型正确找到这个路径
zero-shot主要是在大数据量上预训练了,所以跟few-shot还是有区别的
我比较看好在大数据上预训练过的大模型


其实这个评估有点儿问题?如何评判稳定性?你这个只是在建立的测试样本集上的结果而已,并不是大量的数据评估结果,特别是放到真实业务场景下的分析结果,我觉得每个类别多点儿数据不是坏事,可以加强特征到标签的连接,特别是捕获正确的特征

If we assume that evaluation datasets are large enough that the parameters of linear classifiers trained on them are well estimated, then, because CLIP’s zero-shot classifier is also a linear classifier, the performance of the fully supervised classifiers roughly sets an upper bound for what zero-shot transfer can achieve
从拟合能力上来看,监督学习可以拟合的性能上限,也是zero-shot可以达到的上限

在大数据上学习到通用表征能力,跟在特定数据集上做监督训练,并不是冲突的

Over the past few years, empirical studies of deep learning
systems have documented that performance is predictable as
a function of important quantities such as training compute
and dataset size
这里提到,近年来的深度学习预测能力,是可以评估的,这个确实有点儿意思哈

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/164472.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

hadoop配置文件自检查(解决常见报错问题,超级详细!)

本篇文章主要的内容就是检查配置文件,还有一些常见的报错问题解决方法,希望能够帮助到大家。 一、以下是大家可能会遇到的常见问题: 1.是否遗漏了前置准备的相关操作配置? 2.是否遗的将文件夹(Hadoop安装文件夹,/dat…

图及谱聚类商圈聚类中的应用

背景 在O2O业务场景中,有商圈的概念,商圈是业务运营的单元,有对应的商户BD负责人以及配送运力负责任。这些商圈通常是一定地理围栏构成的区域,区域内包括商户和用户,商圈和商圈之间就通常以道路、河流等围栏进行分隔。…

Promise链式调用改写成async/await

首先,Promise链式调用和async/await都是用来解决异步调用层层嵌套的问题。 promise解决了回调地狱的问题,把异步任务完成后的处理函数换个位置放:传给then方法,并支持链式调用,避免层层回调。用catch方法捕获错误。 …

现一个智能的SQL编辑器

补给资料 管注公众号:码农补给站 前言 目前我司的多个产品中都支持在线编辑 SQL 来生成对应的任务。为了优化用户体验,在使用 MonacoEditor 为编辑器的基础上,我们还支持了如下几个重要功能: 多种 SQL 的语法高亮多种 S…

垃圾回收系统小程序定制开发搭建攻略

在这个数字化快速发展的时代,垃圾回收系统的推广对于环境保护和可持续发展具有重要意义。为了更好地服务于垃圾回收行业,本文将分享如何使用第三方制作平台乔拓云网,定制开发搭建垃圾回收系统小程序。 首先,使用乔拓云网账号登录平…

服务器数据恢复—云服务器mysql数据库表被truncate的数据恢复案例

云服务器数据恢复环境: 阿里云ECS网站服务器,linux操作系统mysql数据库。 云服务器故障: 在执行数据库版本更新测试时,在生产库误执行了本来应该在测试库执行的sql脚本,导致生产库部分表被truncate,还有部…

六种最常见的软件供应链攻击

软件供应链攻击已成为当前网络安全领域的热点话题,其攻击方式的多样性和复杂性使得防御变得极为困难。以下我们整理了六种常见软件供应链攻击方法及其典型案例: 软件供应链攻击已成为当前网络安全领域的热点话题,其攻击方式的多样性和复杂性…

pytorch直线拟合

目录 1、数据分析 2、pytorch直线拟合 1、数据分析 直线拟合的前提条件通常包括以下几点: 存在线性关系:这是进行直线拟合的基础,数据点之间应该存在一种线性关系,即数据的分布可以用直线来近似描述。这种线性关系可以是数据点…

Oracle RAC是啥?

Oracle RAC,全称是Oracle Real Application Cluster,翻译过来为Oracle真正的应用集群,它是Oracle提供的一个并行集群系统,由 Oracle Clusterware(集群就绪软件) 和 Real Application Cluster(RA…

C语言每日一题(25)链表的中间结点

力扣 876. 链表的中间结点 题目描述 给你单链表的头结点 head ,请你找出并返回链表的中间结点。 如果有两个中间结点,则返回第二个中间结点。 思路分析 快慢指针法 用一慢一快指针遍历整个链表,每次遍历,快指针都会比慢指针多…

大文件传输小知识 | UDP和TCP哪个传输速度快?

在网络世界中,好像有两位“传输巨头”常常被提起:UDP和TCP。它们分别代表着用户数据报协议和传输控制协议。那么它们是什么?它们有什么区别?它们在传输大文件时的速度又如何?本文将深度解析这些问题,帮助企…

获奖名单出炉 ,鲲鹏应用创新大赛2023全国总决赛圆满落幕

11月2日,以“数智未来 因你而来”为主题的鲲鹏应用创新大赛2023全国总决赛在四川成都顺利举办。经过长达6个月的层层筛选与激烈角逐,最终从三大赛事、5大赛道中评选出了13个金奖、16个银奖、19个铜奖。 鲲鹏应用创新大赛是面向全球开发者的顶级赛事&…