做cpa能用什么網(wǎng)站seo怎么優(yōu)化簡(jiǎn)述
目錄
- 資料
- SwAV
- 問(wèn)題
- 方法
- 方法的創(chuàng)新點(diǎn)
- 為什么有效
- 有什么可以借鑒的地方
- 聚類
- Multi-crop
- 代碼
- PCL
- 代碼
- Feature Alignment and Uniformity for Test Time Adaptation
- 代碼
- SimSiam
資料
深度聚類算法研究綜述(很贊,從聚類方法和深度學(xué)習(xí)方法兩個(gè)方面進(jìn)行了總結(jié),我這個(gè)可能作為綜述前的一篇文章,要是捋出頭緒也寫一寫):https://www.cnblogs.com/kailugaji/p/15574267.html
對(duì)比學(xué)習(xí)代碼庫(kù):https://github.com/HobbitLong/PyContrast?tab=readme-ov-file
SwAV
論文:https://arxiv.org/pdf/2006.09882
代碼:https://github.com/facebookresearch/swav
視頻(像我一樣的小白必看):https://www.bilibili.com/video/BV1754y187Ki/?spm_id_from=333.337.search-card.all.click
問(wèn)題
現(xiàn)有的方法通常是在線工作的,依賴于大量的顯示的成對(duì)特征的比較,在計(jì)算上是有挑戰(zhàn)性的。
方法
提出了SwAV,不需要成對(duì)進(jìn)行比較。方法同時(shí)對(duì)數(shù)據(jù)進(jìn)行聚類,同時(shí)價(jià)錢同一圖像不同增強(qiáng)之間的一致性,不是像對(duì)比學(xué)習(xí)一樣直接表示特征。
提出了新的數(shù)據(jù)增強(qiáng)multi-crop,在不增加內(nèi)存或計(jì)算需求的情況下,使用不同分辨率的視圖混合代替兩個(gè)全分辨率視圖。
方法的創(chuàng)新點(diǎn)
不需要進(jìn)行成對(duì)比較。
我們的方法是內(nèi)存有效的,就是不需要大的memory bank或者特殊的動(dòng)量網(wǎng)絡(luò)。
為什么有效
L ( z t , z s ) = l ( z t , q s ) + l ( z s , q t ) , L(zt, zs) =\mathscr{l}(z_t, q_s) +\mathscr{l}(z_s, q_t), L(zt,zs)=l(zt?,qs?)+l(zs?,qt?),
通過(guò)將一組特征匹配到一組K個(gè)原型中來(lái)計(jì)算他們的code q t , q s q_t,q_s qt?,qs?。
假設(shè):兩個(gè)特征包含相同的信息,那么是可以通過(guò)其中一個(gè)feature去預(yù)測(cè)另一個(gè)的code的。
有什么可以借鑒的地方
聚類
聚類前的特征先投影到單位球面上得到 z n t z_{nt} znt?。
然后把 z n t z_{nt} znt?映射到K個(gè)可訓(xùn)練的原型向量上,得到 c o d e code code。
如何計(jì)算code
SwAV的損失建立了從feature z s z_s zs?到預(yù)測(cè)code q t q_t qt?,從 z t z_t zt?到預(yù)測(cè)code q s q_s qs?的交換預(yù)測(cè)的問(wèn)題。損失是CE Loss,是計(jì)算的code和 z i z_i zi?和所有在 C C C中的原型之間的點(diǎn)積計(jì)算softmax得到的概率值。
l ( z t , q s ) = ? ∑ k q s ( k ) log ? p t ( k ) l(z_t, q_s) = ? \sum\limits_{k} q_s^{(k)} \log p^{(k)}_t l(zt?,qs?)=?k∑?qs(k)?logpt(k)?
where p t ( k ) p^{(k)}_t pt(k)?是第t個(gè)feature z t z_t zt?與第k個(gè)clusterinng centroid c k c_k ck?進(jìn)行乘積然后乘以 1 τ \frac{1}{τ} τ1?計(jì)算softmax。
下面這個(gè)式子是把展開(kāi)后的 p t ( k ) p^{(k)}_t pt(k)?帶入到損失中,然后拆開(kāi)得到的,前面兩個(gè)是分子,后面是分母。
原型C的跨批次使用,SwAV將多個(gè)實(shí)例聚類到原型。使得同一個(gè)批次中所有的樣本都被原型等分。約束了不同圖像的編碼是不同的,避免了每個(gè)圖像都有相同編碼的平凡解。
使用Q矩陣把特征映射到原型,優(yōu)化Q來(lái)最大化特征和原型之間的相似性。
max ? Q ∈ Q T r ( Q T C T Z ) + ε H ( Q ) , \max\limits_{Q∈Q} Tr (Q^TC^TZ) + εH(Q), Q∈Qmax?Tr(QTCTZ)+εH(Q),
H(·)是熵,控制映射是平滑的,但是強(qiáng)的熵正則化會(huì)導(dǎo)致平凡解,模型會(huì)坍塌,所以保持 ε ε ε要小。
對(duì)Q矩陣進(jìn)行約束。
Q = { Q ∈ R + K × B + ∣ Q 1 B = 1 K 1 K , Q T 1 K = 1 B 1 B } , Q= \{ Q ∈ R^{K×B}_{+} + | Q_{1B} = \frac{1}{K} 1_K , Q^T1_K = \frac{1}{B} 1_B \} , Q={Q∈R+K×B?+∣Q1B?=K1?1K?,QT1K?=B1?1B?},
式中:1K表示K維向量。這些約束要求批次中平均每個(gè)原型至少被選擇B K次。使用連續(xù)的Q*,不進(jìn)行離散化,因?yàn)楂@得離散碼所需的舍入是比梯度更新更激進(jìn)的優(yōu)化步驟。在使模型快速收斂的同時(shí),卻導(dǎo)致了更差的解。
在集合Q上,取正規(guī)化指數(shù)矩陣的形式。
Q ? = D i a g ( u ) e x p ( C T Z ε ) D i a g ( v ) , Q^* = Diag(u) exp( \frac{C^TZ}{ε} ) Diag(v), Q?=Diag(u)exp(εCTZ?)Diag(v),
其中u和v分別是RK和RB中的重整化向量。重整化向量通過(guò)使用迭代的Sinkhorn - Knopp算法,使用少量的矩陣乘法來(lái)計(jì)算。思考一下u,v怎么整的,我記得代碼中有Sinkhorn - Knopp算法。
我們可以使用小批量數(shù)據(jù)。如果batchsize太小,我們使用之前批次的特征來(lái)增加Prob中Z的大小。我們?cè)谟?xùn)練損失中只使用了批特征的編碼。
Multi-crop
正如先前的工作所指出的那樣,通過(guò)捕獲場(chǎng)景或?qū)ο蟛糠种g的關(guān)系信息,比較圖像中的隨機(jī)裁剪起著核心作用。
為了保證內(nèi)存大小保持不變,有兩個(gè)高分辨率的裁剪圖像,和一些低分辨率(小的)裁剪圖像。我們計(jì)算code的時(shí)候只計(jì)算兩個(gè)高分辨率的圖像。計(jì)算損失的時(shí)候,是用其他圖像的特征分別去預(yù)測(cè)兩個(gè)高分辨率的特征得到的損失。
L ( z t 1 , z t 2 , . . . , z t V + 2 ) = ∑ i ∈ 1 , 2 ∑ v = 1 V + 2 1 v ≠ i l ( z t v , q t i ) . L(z_{t_1} , z_{t_2} , . . . , z_{t_{V +2}} ) = \sum\limits_{i∈{1,2}} \sum\limits_{v=1}^{V +2}1_{v\neq i}l(z_{t_v} , q_{t_i} ). L(zt1??,zt2??,...,ztV+2??)=i∈1,2∑?v=1∑V+2?1v=i?l(ztv??,qti??).
代碼
下面是訓(xùn)練階段的代碼主要是計(jì)算指定的裁剪id的圖片計(jì)算聚類分配q(也就是文章中的Q)。然后其他所有的裁剪id計(jì)算z經(jīng)過(guò)預(yù)測(cè)頭得到的輸出p(下面變量中的out)。計(jì)算q和out的一致性損失。
def train(train_loader, model, optimizer, epoch, lr_schedule, queue):batch_time = AverageMeter()data_time = AverageMeter()losses = AverageMeter()# 創(chuàng)建用于記錄批次時(shí)間、數(shù)據(jù)加載時(shí)間和損失的對(duì)象model.train()# 將模型設(shè)為訓(xùn)練模式use_the_queue = False# 隊(duì)列使用標(biāo)志end = time.time()for it, inputs in enumerate(train_loader):# measure data loading timedata_time.update(time.time() - end)# update learning rateiteration = epoch * len(train_loader) + itfor param_group in optimizer.param_groups:param_group["lr"] = lr_schedule[iteration]#是不是兩個(gè)值# normalize the prototypeswith torch.no_grad():#對(duì)原型向量進(jìn)行歸一化w = model.module.prototypes.weight.data.clone()w = nn.functional.normalize(w, dim=1, p=2)model.module.prototypes.weight.copy_(w)# ============ multi-res forward passes ... ============embedding, output = model(inputs)#得到輸出embedding = embedding.detach()bs = inputs[0].size(0)#多尺度前向傳播,得到embedding和output,分離嵌入,這個(gè)嵌入是做什么的呢,這個(gè)嵌入為啥要detach()# ============ swav loss ... ============loss = 0for i, crop_id in enumerate(args.crops_for_assign):#遍歷指定的裁剪的圖像,也就是兩個(gè)分辨率大的哪個(gè),對(duì)每個(gè)裁剪計(jì)算輸出with torch.no_grad():out = output[bs * crop_id: bs * (crop_id + 1)].detach()#思考一下,這個(gè)就是指定的id的圖像的位置,這些裁剪的id的圖像長(zhǎng)度為bs# time to use the queue"""如果隊(duì)列不為空,并且使用隊(duì)列或隊(duì)列已滿,則將隊(duì)列中的嵌入與當(dāng)前輸出拼接。更新隊(duì)列,移動(dòng)舊的嵌入,并插入新的嵌入。"""if queue is not None:#為什么使用queue?if use_the_queue or not torch.all(queue[i, -1, :] == 0):use_the_queue = Trueout = torch.cat((torch.mm(queue[i],#Q#相乘是計(jì)算相似度嗎model.module.prototypes.weight.t()#C), out))# fill the queuequeue[i, bs:] = queue[i, :-bs].clone()#移動(dòng)queue[i, :bs] = embedding[crop_id * bs: (crop_id + 1) * bs]#crop_id是一個(gè)數(shù)吧# get assignments獲取聚類分配。q = distributed_sinkhorn(out)[-bs:]#這個(gè)是獲得q這個(gè)是聚類分配的q#計(jì)算SwAV損失,遍歷所有裁剪圖像,計(jì)算輸出和聚類分配之間的交叉熵?fù)p失。# cluster assignment predictionsubloss = 0for v in np.delete(np.arange(np.sum(args.nmb_crops)), crop_id):#這個(gè)才是每一個(gè)裁剪圖像x = output[bs * v: bs * (v + 1)] / args.temperaturesubloss -= torch.mean(torch.sum(q * F.log_softmax(x, dim=1), dim=1))#聚類和輸出進(jìn)行損失計(jì)算。loss += subloss / (np.sum(args.nmb_crops) - 1)loss /= len(args.crops_for_assign)# ============ backward and optim step ... ============optimizer.zero_grad()if args.use_fp16:with apex.amp.scale_loss(loss, optimizer) as scaled_loss:scaled_loss.backward()else:loss.backward()# cancel gradients for the prototypesif iteration < args.freeze_prototypes_niters:for name, p in model.named_parameters():if "prototypes" in name:p.grad = Noneoptimizer.step()# ============ misc ... ============losses.update(loss.item(), inputs[0].size(0))batch_time.update(time.time() - end)end = time.time()if args.rank ==0 and it % 50 == 0:logger.info("Epoch: [{0}][{1}]\t""Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t""Data {data_time.val:.3f} ({data_time.avg:.3f})\t""Loss {loss.val:.4f} ({loss.avg:.4f})\t""Lr: {lr:.4f}".format(epoch,it,batch_time=batch_time,data_time=data_time,loss=losses,lr=optimizer.optim.param_groups[0]["lr"],))return (epoch, losses.avg), queue
下面這個(gè)是計(jì)算的聚類分配的代碼。
@torch.no_grad()#優(yōu)化原型分配Q是將feature映射到C的一個(gè)函數(shù)
def distributed_sinkhorn(out):#這個(gè)是最有最優(yōu)傳輸?shù)?/span>Q = torch.exp(out / args.epsilon).t() # Q is K-by-B for consistency with notations from our paper#計(jì)算輸入張量out的指數(shù)并進(jìn)行轉(zhuǎn)置,得到一個(gè)大小為K*B的矩陣# 計(jì)算總的樣本數(shù)量BB = Q.shape[1] * args.world_size # number of samples to assign# K是原型的數(shù)量,即Q矩陣的行數(shù)K = Q.shape[0] # how many prototypes# make the matrix sums to 1sum_Q = torch.sum(Q)dist.all_reduce(sum_Q)Q /= sum_Qfor it in range(args.sinkhorn_iterations):#迭代這些次數(shù)# normalize each row: total weight per prototype must be 1/Ksum_of_rows = torch.sum(Q, dim=1, keepdim=True)dist.all_reduce(sum_of_rows)Q /= sum_of_rowsQ /= K# normalize each column: total weight per sample must be 1/BQ /= torch.sum(Q, dim=0, keepdim=True)Q /= BQ *= B # the colomns must sum to 1 so that Q is an assignmentreturn Q.t()
resnet代碼中添加了原型頭和預(yù)測(cè)頭。
PCL
論文:https://arxiv.org/pdf/2005.04966
代碼:https://github.com/salesforce/PCL
A prototype is define as “a representative embedding for a group of semantically similar instance”.#原型的定義是全篇的精髓,也是緊緊圍繞的中心。
這篇文章是用原型做對(duì)比學(xué)習(xí)。
K-means最佳K值下的類別應(yīng)該是類內(nèi)距離最小化類間距離最大化
無(wú)監(jiān)督視覺(jué)表示學(xué)習(xí)的任務(wù)是學(xué)習(xí)一個(gè) embedding 函數(shù)把 x x x 映射到 v = { v 1 , v 2 , … , v n } v = \{v_1, v2, … ,v_n \} v={v1?,v2,…,vn?},其中 v i = f θ ( x i ) v_i = f_{\theta}(x_i) vi?=fθ?(xi?)。
本文使用prototyprs c c c 取代 v ′ v' v′,并且使用每個(gè)prototype的密度估計(jì) ? \phi ?代替固定的溫度系數(shù) τ \tau τ,所以設(shè)計(jì)了一個(gè)原型的對(duì)比學(xué)習(xí)損失。這個(gè)是一個(gè)創(chuàng)新點(diǎn)。
使用所有特征到原型的距離的二范數(shù)來(lái)度量密度。
意義就是在松散的聚類中心的周圍的點(diǎn),相似性被降低,所以需要拉近embedding和proto的距離。
相反的在緊密的聚類中心周圍的點(diǎn),有更高尺度的相似性,所以不鼓勵(lì)特征去接近這個(gè)原型。
ProtoNCE生成密度相似的更加均衡的聚類中心。
后面就是把這個(gè)方法套入到EM算法中,和EM算法的推導(dǎo)幾乎一模一樣。感興趣的可以看EM算法的推導(dǎo)。https://zhuanlan.zhihu.com/p/36331115
最后一個(gè)是互信息的進(jìn)行分析。這里主要是從互信息的角度解釋了為什么Proto NCE由于Info NCE。
主要有以下幾個(gè)優(yōu)點(diǎn):
1. Proto NCE忽略了個(gè)體的noise,能夠獲得更加高水平的語(yǔ)義特征。
2. 與Info NCE相比,原型與標(biāo)簽之間存在更大的互信息。這些得益于有效的聚類。
代碼
主要使用了fassi庫(kù)中的clus進(jìn)行聚類。在Moco的代碼上進(jìn)行修改。
模型代碼前面和Moco相同,主要是后面使用了輸入的聚類相關(guān)信息與特征進(jìn)行
主要理解下面這段代碼就行。
在train階段先使用fassi庫(kù)中的函數(shù)對(duì)這些特征進(jìn)行聚類,得到的圖像的聚類中心的編號(hào)索引原型作為正樣本。proto_logits是當(dāng)前的特征和所選擇的原型之間的乘積算是計(jì)算相似度了。剛開(kāi)始不明白的點(diǎn)是為什么label是使用linspace進(jìn)行生成的,側(cè)面是樣本,上面是聚類中心的話,我們選擇的pos_proto是按照每個(gè)樣本索引獲得的原型的索引,根據(jù)原型的索引的到的原型作為正樣本,所以標(biāo)簽是linspace得到的沒(méi)問(wèn)題。然后還有一點(diǎn)就是它的密度。因?yàn)樗倪@個(gè)”密度“的衡量是特征到聚類中心的距離的二范數(shù)得到的,所以這個(gè)數(shù)越大,密度越小,這個(gè)數(shù)越小密度越大,作為分母,”密度“越大,然后logit值越小,與label之間的差距越大,loss高就迫使模型給它拉近。我的猜測(cè)。
if cluster_result is not None: #如果提供了聚類結(jié)果,則執(zhí)行以下代碼塊。proto_labels = []proto_logits = []for n, (im2cluster,prototypes,density) in enumerate(zip(cluster_result['im2cluster'],cluster_result['centroids'],cluster_result['density'])):# get positive prototypespos_proto_id = im2cluster[index]#得看看這個(gè)im2cluster和prototypes是怎么合在一起的pos_prototypes = prototypes[pos_proto_id] #初始化原型標(biāo)簽和對(duì)數(shù)幾率的列表,并遍歷聚類結(jié)果。# sample negative prototypes#獲取正樣本的原型ID和對(duì)應(yīng)的原型。all_proto_id = [i for i in range(im2cluster.max()+1)]# neg_proto_id = set(all_proto_id)-set(pos_proto_id.tolist())#剩下的就是負(fù)的原型neg_proto_id = sample(neg_proto_id,self.r) #sample r negative prototypes neg_prototypes = prototypes[neg_proto_id] #這些就是負(fù)的原型 #采樣負(fù)樣本的原型。proto_selected = torch.cat([pos_prototypes,neg_prototypes],dim=0)#選擇的原型# compute prototypical logitslogits_proto = torch.mm(q,proto_selected.t())#q就是當(dāng)前得到的特征# targets for prototype assignmentlabels_proto = torch.linspace(0, q.size(0)-1, steps=q.size(0)).long().cuda()##這些是聚類的label# scaling temperatures for the selected prototypestemp_proto = density[torch.cat([pos_proto_id,torch.LongTensor(neg_proto_id).cuda()],dim=0)]#logits_proto /= temp_proto #proto_labels.append(labels_proto)proto_logits.append(logits_proto)return logits, labels, proto_logits, proto_labelselse:return logits, labels, None, None
下面是完整的模型的代碼。
import torch
import torch.nn as nn
from random import sampleclass MoCo(nn.Module):"""Build a MoCo model with: a query encoder, a key encoder, and a queuehttps://arxiv.org/abs/1911.05722"""def __init__(self, base_encoder, dim=128, r=16384, m=0.999, T=0.1, mlp=False):"""dim: feature dimension (default: 128)r: queue size; number of negative samples/prototypes (default: 16384)m: momentum for updating key encoder (default: 0.999)T: softmax temperature mlp: whether to use mlp projection"""super(MoCo, self).__init__()self.r = rself.m = mself.T = T# create the encoders# num_classes is the output fc dimensionself.encoder_q = base_encoder(num_classes=dim)self.encoder_k = base_encoder(num_classes=dim)if mlp: # hack: brute-force replacementdim_mlp = self.encoder_q.fc.weight.shape[1]self.encoder_q.fc = nn.Sequential(nn.Linear(dim_mlp, dim_mlp), nn.ReLU(), self.encoder_q.fc)self.encoder_k.fc = nn.Sequential(nn.Linear(dim_mlp, dim_mlp), nn.ReLU(), self.encoder_k.fc)for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):param_k.data.copy_(param_q.data) # initializeparam_k.requires_grad = False # not update by gradient# create the queueself.register_buffer("queue", torch.randn(dim, r))#使用了隊(duì)列,隊(duì)列里面是特征嗎self.queue = nn.functional.normalize(self.queue, dim=0)self.register_buffer("queue_ptr", torch.zeros(1, dtype=torch.long))@torch.no_grad()def _momentum_update_key_encoder(self):#動(dòng)量更新"""Momentum update of the key encoder"""for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):param_k.data = param_k.data * self.m + param_q.data * (1. - self.m)@torch.no_grad()def _dequeue_and_enqueue(self, keys):# gather keys before updating queuekeys = concat_all_gather(keys)batch_size = keys.shape[0]ptr = int(self.queue_ptr)assert self.r % batch_size == 0 # for simplicity# replace the keys at ptr (dequeue and enqueue)self.queue[:, ptr:ptr + batch_size] = keys.Tptr = (ptr + batch_size) % self.r # move pointerself.queue_ptr[0] = ptr@torch.no_grad()def _batch_shuffle_ddp(self, x):"""Batch shuffle, for making use of BatchNorm.*** Only support DistributedDataParallel (DDP) model. ***"""# gather from all gpusbatch_size_this = x.shape[0]x_gather = concat_all_gather(x)batch_size_all = x_gather.shape[0]num_gpus = batch_size_all // batch_size_this# random shuffle indexidx_shuffle = torch.randperm(batch_size_all).cuda()# broadcast to all gpustorch.distributed.broadcast(idx_shuffle, src=0)# index for restoringidx_unshuffle = torch.argsort(idx_shuffle)# shuffled index for this gpugpu_idx = torch.distributed.get_rank()idx_this = idx_shuffle.view(num_gpus, -1)[gpu_idx]return x_gather[idx_this], idx_unshuffle@torch.no_grad()def _batch_unshuffle_ddp(self, x, idx_unshuffle):"""Undo batch shuffle.*** Only support DistributedDataParallel (DDP) model. ***"""# gather from all gpusbatch_size_this = x.shape[0]x_gather = concat_all_gather(x)batch_size_all = x_gather.shape[0]num_gpus = batch_size_all // batch_size_this# restored index for this gpugpu_idx = torch.distributed.get_rank()idx_this = idx_unshuffle.view(num_gpus, -1)[gpu_idx]return x_gather[idx_this]def forward(self, im_q, im_k=None, is_eval=False, cluster_result=None, index=None):"""Input:im_q: a batch of query imagesim_k: a batch of key imagesis_eval: return momentum embeddings (used for clustering)cluster_result: cluster assignments, centroids, and densityindex: indices for training samplesOutput:logits, targets, proto_logits, proto_targets"""if is_eval:k = self.encoder_k(im_q) k = nn.functional.normalize(k, dim=1) return k# compute key featureswith torch.no_grad(): # no gradient to keysself._momentum_update_key_encoder() # update the key encoder# shuffle for making use of BNim_k, idx_unshuffle = self._batch_shuffle_ddp(im_k)k = self.encoder_k(im_k) # keys: NxCk = nn.functional.normalize(k, dim=1)# undo shufflek = self._batch_unshuffle_ddp(k, idx_unshuffle)# compute query featuresq = self.encoder_q(im_q) # queries: NxCq = nn.functional.normalize(q, dim=1)# compute logits# Einstein sum is more intuitive# positive logits: Nx1l_pos = torch.einsum('nc,nc->n', [q, k]).unsqueeze(-1)# negative logits: Nxrl_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()])#k個(gè)原型,每個(gè)是c維度# logits: Nx(1+r)logits = torch.cat([l_pos, l_neg], dim=1)# apply temperaturelogits /= self.T# labels: positive key indicatorslabels = torch.zeros(logits.shape[0], dtype=torch.long).cuda()# dequeue and enqueueself._dequeue_and_enqueue(k)#下面這個(gè)就屬于空白區(qū)域了# prototypical contrastif cluster_result is not None: #如果提供了聚類結(jié)果,則執(zhí)行以下代碼塊。proto_labels = []proto_logits = []for n, (im2cluster,prototypes,density) in enumerate(zip(cluster_result['im2cluster'],cluster_result['centroids'],cluster_result['density'])):# get positive prototypespos_proto_id = im2cluster[index]#得看看這個(gè)im2cluster和prototypes是怎么合在一起的pos_prototypes = prototypes[pos_proto_id] #初始化原型標(biāo)簽和對(duì)數(shù)幾率的列表,并遍歷聚類結(jié)果。# sample negative prototypes#獲取正樣本的原型ID和對(duì)應(yīng)的原型。all_proto_id = [i for i in range(im2cluster.max()+1)]# neg_proto_id = set(all_proto_id)-set(pos_proto_id.tolist())#剩下的就是負(fù)的原型neg_proto_id = sample(neg_proto_id,self.r) #sample r negative prototypes neg_prototypes = prototypes[neg_proto_id] #這些就是負(fù)的原型 #采樣負(fù)樣本的原型。proto_selected = torch.cat([pos_prototypes,neg_prototypes],dim=0)#選擇的原型# compute prototypical logitslogits_proto = torch.mm(q,proto_selected.t())#q就是當(dāng)前得到的特征# targets for prototype assignmentlabels_proto = torch.linspace(0, q.size(0)-1, steps=q.size(0)).long().cuda()##這些是聚類的label# scaling temperatures for the selected prototypestemp_proto = density[torch.cat([pos_proto_id,torch.LongTensor(neg_proto_id).cuda()],dim=0)]#logits_proto /= temp_proto #proto_labels.append(labels_proto)proto_logits.append(logits_proto)return logits, labels, proto_logits, proto_labelselse:return logits, labels, None, None
值得學(xué)習(xí)的地方我感覺(jué)就是這個(gè)run_kmeans聚類了,調(diào)用了fassi庫(kù)。
def run_kmeans(x, args):#就是使用faiss庫(kù)中的函數(shù)完成的聚類。"""Args:x: data to be clustered"""print('performing kmeans clustering')results = {'im2cluster':[],'centroids':[],'density':[]}#初始化一個(gè)字典用來(lái)存儲(chǔ)聚類結(jié)果for seed, num_cluster in enumerate(args.num_cluster):#遍歷聚類中心# intialize faiss clustering parametersd = x.shape[1]#獲取x的維度d應(yīng)該x就是個(gè)二維的k = int(num_cluster)clus = faiss.Clustering(d, k)#創(chuàng)建一個(gè)聚類對(duì)象,指定數(shù)據(jù)維度和聚類數(shù)量clus.verbose = Trueclus.niter = 20#迭代次數(shù)為20clus.nredo = 5#重新聚類次數(shù)為5clus.seed = seedclus.max_points_per_centroid = 1000#每個(gè)中心點(diǎn)的最大樣本數(shù)clus.min_points_per_centroid = 10res = faiss.StandardGpuResources()#初始化聚類資源cfg = faiss.GpuIndexFlatConfig()#創(chuàng)建gpu索引配置cfg.useFloat16 = Falsecfg.device = args.gpu index = faiss.GpuIndexFlatL2(res, d, cfg)#創(chuàng)建一個(gè)L2位置索引 clus.train(x, index) #利用clus對(duì)象在index上訓(xùn)練數(shù)據(jù)xD, I = index.search(x, 1) # for each sample, find cluster distance and assignments搜索數(shù)據(jù)'x'找到每個(gè)樣本的聚類距離D和分配rim2cluster = [int(n[0]) for n in I]#轉(zhuǎn)換成列表每個(gè)元素對(duì)應(yīng)一個(gè)樣本所屬的聚類# get cluster centroidscentroids = faiss.vector_to_array(clus.centroids).reshape(k,d)#獲取聚類中心點(diǎn),將其轉(zhuǎn)換成numpy數(shù)組# sample-to-centroid distances for each cluster Dcluster = [[] for c in range(k)]#初始化一個(gè)列表用于存儲(chǔ)每個(gè)聚類的樣本距離for im,i in enumerate(im2cluster):Dcluster[i].append(D[im][0])# concentration estimation (phi) density = np.zeros(k)#density初始化為0數(shù)組for i,dist in enumerate(Dcluster):#計(jì)算密度,是距離的開(kāi)根號(hào)然后取平均if len(dist)>1:#計(jì)算每個(gè)聚類的密度,如果聚類中樣本數(shù)大于 1,則計(jì)算距離的均值并進(jìn)行縮放。d = (np.asarray(dist)**0.5).mean()/np.log(len(dist)+10) density[i] = d #if cluster only has one point, use the max to estimate its concentration dmax = density.max()#獲取最大密度 dmax。for i,dist in enumerate(Dcluster):#對(duì)于只有一個(gè)樣本的聚類,使用 dmax 作為其密度。if len(dist)<=1:density[i] = dmax density = density.clip(np.percentile(density,10),np.percentile(density,90)) #clamp extreme values for stability#對(duì)密度進(jìn)行裁剪,限制在第 10 和第 90 百分位之間。density = args.temperature*density/density.mean() #scale the mean to temperature 將密度縮放到指定溫度 args.temperature。# convert to cuda Tensors for broadcastcentroids = torch.Tensor(centroids).cuda()centroids = nn.functional.normalize(centroids, p=2, dim=1) im2cluster = torch.LongTensor(im2cluster).cuda() density = torch.Tensor(density).cuda()#將每個(gè)聚類的結(jié)果添加到 results 字典中。results['centroids'].append(centroids)results['density'].append(density)results['im2cluster'].append(im2cluster) return results
Feature Alignment and Uniformity for Test Time Adaptation
這個(gè)是我接觸的第一篇我搞懂一點(diǎn)的聚類。
這個(gè)就是直接的feature聚類。
第一次將TTA作為特征修訂問(wèn)題來(lái)解決。
代碼
主體代碼如下:
本文主要有兩個(gè)任務(wù):
一個(gè)是一致性任務(wù):計(jì)算兩個(gè)輸出之間的一致性損失主要代碼在 def prototype_loss(self,z,p,labels=None,use_hard=False,tau=1)這個(gè)函數(shù)中,然后其中z是特征,p是weights = s u p p o r t s T supports^T supportsTlabels這個(gè)是支持特征與score之間對(duì)應(yīng)的矩陣,dist = z ? p T z *p^T z?pT意思就是將當(dāng)前樣本的特征與這個(gè)weights(代碼中的p)相乘得到BK的值,也就是dist作為聚類任務(wù)的輸出。labels經(jīng)過(guò)argmax得到硬標(biāo)簽。
另一個(gè)是對(duì)齊,更新聚類中心。在topk_cluster這個(gè)函數(shù)中,z.detach().clone()特征supports支持的特征memory中保存的,self.scores得分,p是分類頭輸出,k是幾個(gè)近鄰。
先進(jìn)行歸一化,然后用特征計(jì)算相似度矩陣,選k個(gè)最相似的score與p計(jì)算差值平方作為距離,
loss = -sim_matrix*diff_scores越接近,最后輸出應(yīng)該越相似。
class TSD(nn.Module):"""Test-time Self-Distillation (TSD)CVPR 2023"""def __init__(self,model,optimizer,lam=0,filter_K=100,steps=1,episodic=False):super().__init__()self.model = modelself.featurizer = model.featurizer#這個(gè)是進(jìn)行特征提取self.classifier = model.classifier#分類頭self.optimizer = optimizerself.steps = stepsassert steps > 0, "requires >= 1 step(s) to forward and update"self.episodic = episodic#是否是記憶的self.filter_K = filter_K#選擇的支持樣本的數(shù)量warmup_supports = self.classifier.fc.weight.data.detach()#獲取分類頭的權(quán)重self.num_classes = warmup_supports.size()[0]#獲取類別的數(shù)量self.warmup_supports = warmup_supports#這個(gè)是進(jìn)行初始化warmup_prob = self.classifier(self.warmup_supports)#獲取分類頭的輸出self.warmup_ent = softmax_entropy(warmup_prob)#這個(gè)是計(jì)算熵值#self.warmup_labels = F.one_hot(warmup_prob.argmax(1), num_classes=self.num_classes).float()#獲取預(yù)測(cè)的標(biāo)簽,然后one-hot編碼self.warmup_scores = F.softmax(warmup_prob,1)#self.supports = self.warmup_supports.data#這個(gè)是進(jìn)行初始化,self.labels = self.warmup_labels.dataself.ent = self.warmup_ent.dataself.scores = self.warmup_scores.dataself.lam = lamdef forward(self,x):z = self.featurizer(x)p = self.classifier(z)#模型預(yù)測(cè)yhat = F.one_hot(p.argmax(1), num_classes=self.num_classes).float()#得到的預(yù)測(cè)標(biāo)簽ent = softmax_entropy(p)#計(jì)算熵值,用來(lái)進(jìn)行過(guò)濾???scores = F.softmax(p,1)#計(jì)算概率???#如果可以我可以直接用這個(gè)代碼,#概率值的分布,scores進(jìn)行概率更新with torch.no_grad():self.supports = self.supports.to(z.device)self.labels = self.labels.to(z.device)self.ent = self.ent.to(z.device)self.scores = self.scores.to(z.device)#移動(dòng)到當(dāng)前設(shè)備上self.supports = torch.cat([self.supports,z])#為什么都合并起來(lái)self.labels = torch.cat([self.labels,yhat])self.ent = torch.cat([self.ent,ent])#熵值self.scores = torch.cat([self.scores,scores])supports, labels = self.select_supports()#選擇,是不是保持支持樣本的數(shù)量,合并起來(lái)是為了更新,然后選擇就是進(jìn)行更新supports = F.normalize(supports, dim=1)#歸一化weights = (supports.T @ (labels))#計(jì)算權(quán)重,是不是就是那個(gè)把概率當(dāng)成權(quán)值,然后對(duì)feature進(jìn)行加權(quán)求和dist,loss = self.prototype_loss(z,weights.T,scores,use_hard=False)loss_local = topk_cluster(z.detach().clone(),supports,self.scores,p,k=3)#計(jì)算損失loss += self.lam*loss_local#加上一個(gè)正則項(xiàng)self.optimizer.zero_grad()loss.backward()self.optimizer.step()return pdef select_supports(self):ent_s = self.ent#獲取樣本的熵值,根據(jù)熵值來(lái)進(jìn)行樣本選擇?y_hat = self.labels.argmax(dim=1).long()#從對(duì)象的標(biāo)簽屬性中獲得預(yù)測(cè)的標(biāo)簽,以及類別索引filter_K = self.filter_K#這個(gè)是支持的樣本數(shù)量if filter_K == -1:#如果filter_K=-1,那么就是所有的樣本indices = torch.LongTensor(list(range(len(ent_s))))#獲得indiceindices = []#這個(gè)是indices1 = torch.LongTensor(list(range(len(ent_s)))).cuda()for i in range(self.num_classes):#對(duì)每一個(gè)類別進(jìn)行操作_, indices2 = torch.sort(ent_s[y_hat == i])#對(duì)當(dāng)前類別的熵值進(jìn)行排序,就是獲得熵值最小的樣本indices.append(indices1[y_hat==i][indices2][:filter_K])#放入到indices中存起來(lái),#這個(gè)就是對(duì)每一個(gè)類別進(jìn)行操作,然后把每一個(gè)類別的熵值最小的樣本放入到indices中indices = torch.cat(indices)#就是進(jìn)行indices的catself.supports = self.supports[indices]#合并成一個(gè)張量self.labels = self.labels[indices]#更新支持,標(biāo)簽和熵,這個(gè)進(jìn)行更新了,前面的cat之后,在這里進(jìn)行選取然后更新?????self.ent = self.ent[indices]self.scores = self.scores[indices]return self.supports, self.labels#軟標(biāo)簽和支持樣本def prototype_loss(self,z,p,labels=None,use_hard=False,tau=1):#這個(gè)是原型損失#z [batch_size,feature_dim]#p [num_class,feature_dim]#labels [batch_size,] z = F.normalize(z,1)#featurep = F.normalize(p,1)#logit?、"""在原型損失函數(shù) prototype_loss() 中,p 用來(lái)表示類別的原型。在這個(gè)具體的實(shí)現(xiàn)中,p 并不是傳統(tǒng)意義上的原型向量,而是一個(gè)包含每個(gè)類別預(yù)測(cè)概率的張量。這種方法的一個(gè)優(yōu)勢(shì)是它能夠通過(guò)概率分布更好地捕捉類別之間的關(guān)系,而不僅僅是通過(guò)一個(gè)固定的原型向量來(lái)表示類別。在原型損失函數(shù)中,dist 是通過(guò)計(jì)算 z 和 p 之間的點(diǎn)積得到的。然后,根據(jù)是否使用硬標(biāo)簽(use_hard)來(lái)選擇使用軟標(biāo)簽或硬標(biāo)簽來(lái)進(jìn)行損失計(jì)算。在軟標(biāo)簽情況下,使用了交叉熵?fù)p失函數(shù),而在硬標(biāo)簽情況下,直接使用了 F.cross_entropy 函數(shù)來(lái)計(jì)算損失。因此,雖然p 不是傳統(tǒng)意義上的原型向量,但在這個(gè)實(shí)現(xiàn)中,它被用作表示類別的概率分布,被輸入到損失函數(shù)中與特征向量 z 進(jìn)行比較和損失計(jì)算。"""dist = z @ p.T / tau#logit和feature的點(diǎn)積,計(jì)算這個(gè)feature和每個(gè)類別的特征中心的相似度"""z 是一個(gè)形狀為 [batch_size, feature_dim] 的張量,表示輸入數(shù)據(jù)的特征向量。這些特征向量經(jīng)過(guò)某些層的處理后得到。p 是一個(gè)形狀為 [num_class, feature_dim] 的張量,表示原型向量。每個(gè)類別都有一個(gè)原型向量,這些原型向量通常是通過(guò)類別特征的均值來(lái)獲得的。"""if labels is None:_,labels = dist.max(1)#根據(jù)相似度來(lái)進(jìn)行分類if use_hard:"""use hard label for supervision """#_,labels = dist.max(1) #for prototype-based pseudo-labellabels = labels.argmax(1) #for logits-based pseudo-labelloss = F.cross_entropy(dist,labels)else:"""use soft label for supervision """#label不是none的話就loss = softmax_kl_loss(labels.detach(),dist).sum(1).mean(0) #detach is **necessary**#loss = softmax_kl_loss(dist,labels.detach()).sum(1).mean(0) achieves comparable resultsreturn dist,loss
下面是相關(guān)的一些功能上的代碼。
def topk_labels(feature,supports,scores,k=3):feature = F.normalize(feature,1)supports = F.normalize(supports,1)sim_matrix = feature @ supports.T #B,M_,idx_near = torch.topk(sim_matrix,k,dim=1) #batch x Kscores_near = scores[idx_near] #batch x K x num_classsoft_labels = torch.mean(scores_near,1) #batch x num_classsoft_labels = torch.argmax(soft_labels,1)return soft_labelsdef topk_cluster(feature,supports,scores,p,k=3):#p: outputs of model batch x num_classfeature = F.normalize(feature,1)supports = F.normalize(supports,1)sim_matrix = feature @ supports.T #B,Mtopk_sim_matrix,idx_near = torch.topk(sim_matrix,k,dim=1) #batch x Kscores_near = scores[idx_near].detach().clone() #batch x K x num_classdiff_scores = torch.sum((p.unsqueeze(1) - scores_near)**2,-1)loss = -1.0* topk_sim_matrix * diff_scoresreturn loss.mean()def knn_affinity(X,knn):#x [N,D]N = X.size(0)X = F.normalize(X,1)dist = torch.norm(X.unsqueeze(0) - X.unsqueeze(1), dim=-1, p=2) # [N, N]n_neighbors = min(knn + 1, N)knn_index = dist.topk(n_neighbors, -1, largest=False).indices[:, 1:] # [N, knn]W = torch.zeros(N, N, device=X.device)W.scatter_(dim=-1, index=knn_index, value=1.0)return Wdef softmax_mse_loss(input_logits, target_logits):"""Takes softmax on both sides and returns MSE lossNote:- Returns the sum over all examples. Divide by the batch size afterwardsif you want the mean.- Sends gradients to inputs but not the targets."""assert input_logits.size() == target_logits.size()input_softmax = F.softmax(input_logits, dim=1)target_softmax = F.softmax(target_logits, dim=1)mse_loss = (input_softmax-target_softmax)**2return mse_lossdef softmax_kl_loss(input_logits, target_logits):"""對(duì) input_logits 使用 F.log_softmax 函數(shù)進(jìn)行 softmax 操作,并取對(duì)數(shù)得到對(duì)數(shù)概率值。對(duì) target_logits 使用 F.softmax 函數(shù)進(jìn)行 softmax 操作,得到概率分布。使用 F.kl_div 函數(shù)計(jì)算兩個(gè)概率分布之間的 KL 散度。參數(shù) reduction='none' 表示不對(duì) KL 散度進(jìn)行求和,保留每個(gè)樣本的 KL 散度值。""""""Takes softmax on both sides and returns KL divergenceNote:- Returns the sum over all examples. Divide by the batch size afterwardsif you want the mean.- Sends gradients to inputs but not the targets."""assert input_logits.size() == target_logits.size()input_log_softmax = F.log_softmax(input_logits, dim=1)target_softmax = F.softmax(target_logits, dim=1)kl_div = F.kl_div(input_log_softmax, target_softmax, reduction='none')return kl_div def get_distances(X, Y, dist_type="cosine"):#計(jì)算距離"""Args:X: (N, D) tensorY: (M, D) tensor"""if dist_type == "euclidean":distances = torch.cdist(X, Y)elif dist_type == "cosine":distances = 1 - torch.matmul(F.normalize(X, dim=1), F.normalize(Y, dim=1).T)else:raise NotImplementedError(f"{dist_type} distance not implemented.")return distances@torch.no_grad()
def soft_k_nearest_neighbors(features, features_bank, probs_bank):pred_probs = []K = 4for feats in features.split(64):distances = get_distances(feats, features_bank,"cosine")_, idxs = distances.sort()idxs = idxs[:, : K]# (64, num_nbrs, num_classes), average over dim=1probs = probs_bank[idxs, :].mean(1)pred_probs.append(probs)pred_probs = torch.cat(pred_probs)_, pred_labels = pred_probs.max(dim=1)return pred_labels, pred_probs
SimSiam
論文:https://arxiv.org/abs/2011.10566
代碼:https://github.com/facebookresearch/simsiam
論文寫的很好了,大家都去看論文吧。代碼很簡(jiǎn)單和偽代碼差不多。