當(dāng)前位置：首頁 > news >正文

婚紗攝影網(wǎng)站seo方案友情鏈接怎么做

news 2025/7/5 1:06:00

婚紗攝影網(wǎng)站seo方案,友情鏈接怎么做,在線海報(bào)免費(fèi)制作,wordpress 插件系統(tǒng)遷移學(xué)習(xí) 1、fasttext概述作為NLP工程領(lǐng)域常用的工具包, fasttext有兩大作用：進(jìn)行文本分類、訓(xùn)練詞向量正如它的名字, 在保持較高精度的情況下, 快速的進(jìn)行訓(xùn)練和預(yù)測是fasttext的最大優(yōu)勢。fasttext工具包中內(nèi)含的fasttext模型具有十分簡單的網(wǎng)絡(luò)結(jié)構(gòu)。使用fa…

遷移學(xué)習(xí)

1、fasttext概述

作為NLP工程領(lǐng)域常用的工具包, fasttext有兩大作用：進(jìn)行文本分類、訓(xùn)練詞向量

正如它的名字, 在保持較高精度的情況下, 快速的進(jìn)行訓(xùn)練和預(yù)測是fasttext的最大優(yōu)勢。fasttext工具包中內(nèi)含的fasttext模型具有十分簡單的網(wǎng)絡(luò)結(jié)構(gòu)。使用fasttext模型訓(xùn)練詞向量時(shí)使用層次softmax結(jié)構(gòu), 來提升超多類別下的模型性能。由于fasttext模型過于簡單無法捕捉詞序特征, 因此會(huì)進(jìn)行n-gram特征提取以彌補(bǔ)模型缺陷提升精度。

2、fasttext模型架構(gòu)

FastText 模型架構(gòu)和 Word2Vec 中的 CBOW 模型很類似, 不同之處在于, FastText 預(yù)測標(biāo)簽, 而 CBOW 模型預(yù)測中間詞。

FastText的模型分為三層架構(gòu):

輸入層: 是對(duì)文檔embedding之后的向量, 包含N-gram特征
隱藏層: 是對(duì)輸入數(shù)據(jù)的求和平均
輸出層: 是文檔對(duì)應(yīng)的label

(一)、層次softmax

為了提高效率, 在fastText中計(jì)算分類標(biāo)簽概率的時(shí)候, 不再使用傳統(tǒng)的softmax來進(jìn)行多分類的計(jì)算, 而是使用哈夫曼樹, 使用層次化的softmax來進(jìn)行概率的計(jì)算。

(1)、哈夫曼樹

當(dāng)利用n 個(gè)結(jié)點(diǎn)試圖構(gòu)建一棵樹時(shí), 如果構(gòu)建的這棵樹的帶權(quán)路徑長度最小, 稱這棵樹為“最優(yōu)二叉樹”, 有時(shí)也叫“赫夫曼樹”或者“哈夫曼樹”。

權(quán)值越大的節(jié)點(diǎn)距離根節(jié)點(diǎn)也較近。

(2)、構(gòu)建哈夫曼樹

假設(shè)有n個(gè)權(quán)值, 則構(gòu)造出的哈夫曼樹有n個(gè)葉子節(jié)點(diǎn). n個(gè)權(quán)值分別設(shè)為 w1、w2、…、wn, 則哈夫曼樹的構(gòu)造規(guī)則為:

步驟1: 將w1、w2、…, wn看成是有n 棵樹的森林(每棵樹僅有一個(gè)節(jié)點(diǎn));
步驟2: 在森林中選出兩個(gè)根節(jié)點(diǎn)的權(quán)值最小的樹合并, 作為一顆新樹的左、右子樹, 且新樹的根節(jié)點(diǎn)權(quán)值為其左、右子樹根節(jié)點(diǎn)權(quán)值之和;
步驟3: 從森林中刪除選取的兩棵樹, 并將新樹加入森林;
步驟4: 重復(fù)2-3步驟, 直到森林只有一顆樹為止, 該樹就是所求的哈夫曼樹。

(3)、哈夫曼樹編碼

哈夫曼編碼一般規(guī)定哈夫曼樹中的左分支為 0, 右分支為 1, 從根節(jié)點(diǎn)到每個(gè)葉節(jié)點(diǎn)所經(jīng)過的分支對(duì)應(yīng)的 0 和 1 組成的序列便為該節(jié)點(diǎn)對(duì)應(yīng)字符的編碼。這樣的編碼稱為哈夫曼編碼。

(二)、負(fù)采樣

(1)、策略

減少計(jì)算softmax的token數(shù)量。

噪聲詞獲取策略：指定拿到噪聲詞的數(shù)量K，每個(gè)噪聲詞token被取為噪聲詞的概率為
$P=\frac{f(t_i)^{0.75}}{\sum(f(t_j)^{0.75})}$

(2)、優(yōu)勢

提高訓(xùn)練速度, 選擇了部分?jǐn)?shù)據(jù)進(jìn)行計(jì)算損失, 損失計(jì)算更加簡單
改進(jìn)效果, 增加部分負(fù)樣本, 能夠模擬真實(shí)場景下的噪聲情況, 能夠讓模型的穩(wěn)健性更強(qiáng)，泛化能力更強(qiáng)

3、fasttext文本分類

模型訓(xùn)練

# 進(jìn)行文本分類任務(wù)(有監(jiān)督)
'''
input：輸入的文本
lr：學(xué)習(xí)率
epoch：訓(xùn)練輪次
wordNgram：n-gram特征
dim：詞向量維度
loss：計(jì)算損失的方式，默認(rèn)是softmax，'hs'；還可以選擇'ova'，代表one vs all，改變意味著我們在統(tǒng)一語料下同時(shí)訓(xùn)練多個(gè)二分類模型
'''
fasttext.train_supervised()
# 進(jìn)行文本分類任務(wù)(無監(jiān)督)
fasttext.train_unsupervised()

預(yù)測

model.predict('需要預(yù)測的內(nèi)容')# 返回結(jié)果：
# 元組中的第一項(xiàng)代表標(biāo)簽, 第二項(xiàng)代表對(duì)應(yīng)的概率

測試

model.test('驗(yàn)證集/測試集')# 返回結(jié)果：
# 元組中的每項(xiàng)分別代表, 驗(yàn)證集樣本數(shù)量, 精度以及召回率

保存模型

model.save_model('模型存儲(chǔ)位置')

重加載模型

fasttext.load_model('模型存儲(chǔ)位置')

4、訓(xùn)練詞向量

(一)、訓(xùn)練詞向量的過程：

獲取數(shù)據(jù)
訓(xùn)練詞向量
模型超參數(shù)設(shè)定
模型效果檢驗(yàn)
模型的保存與重加載

(二)、API

獲得指定詞匯的詞向量

model.get_word_vector(word='指定詞匯')

查找鄰近詞

model.get_nearest_neighbors(word='指定詞匯')

5、詞向量遷移

大型語料庫上已經(jīng)進(jìn)行訓(xùn)練完成的詞向量模型，我們可以直接使用這些模型，或者對(duì)模型進(jìn)行改造。

下載詞向量模型壓縮的bin.gz文件
解壓bin.gz文件到bin文件
加載bin文件獲取詞向量
利用鄰近詞進(jìn)行效果檢驗(yàn)

# 使用gunzip進(jìn)行解壓, 獲取cc.zh.300.bin文件
gunzip cc.zh.300.bin.gz
# 加載模型
model = fasttext.load_model("cc.zh.300.bin")
# 使用模型獲得'音樂'這個(gè)名詞的詞向量
model.get_word_vector("海鷗")
# 以'音樂'為例, 返回的鄰近詞基本上與音樂都有關(guān)系, 如樂曲, 音樂會(huì), 聲樂等
model.get_nearest_neighbors("海鷗")

6、遷移學(xué)習(xí)

(一)、概述

(1)、預(yù)訓(xùn)練模型

一般情況下預(yù)訓(xùn)練模型都是大型模型，具備復(fù)雜的網(wǎng)絡(luò)結(jié)構(gòu)，眾多的參數(shù)量，以及在足夠大的數(shù)據(jù)集下進(jìn)行訓(xùn)練而產(chǎn)生的模型.。

在NLP領(lǐng)域，預(yù)訓(xùn)練模型往往是語言模型。因?yàn)?strong>語言模型的訓(xùn)練是無監(jiān)督的，可以獲得大規(guī)模語料，同時(shí)語言模型又是許多典型NLP任務(wù)的基礎(chǔ)，如機(jī)器翻譯，文本生成，閱讀理解等，

常見的預(yù)訓(xùn)練模型有BERT, GPT, roBERTa, transformer-XL等

(2)、微調(diào)

根據(jù)給定的預(yù)訓(xùn)練模型，改變它的部分參數(shù)或者為其新增部分輸出結(jié)構(gòu)后，通過在小部分?jǐn)?shù)據(jù)集上訓(xùn)練，來使整個(gè)模型更好的適應(yīng)特定任務(wù)

(3)、兩種遷移方式

直接使用預(yù)訓(xùn)練模型，進(jìn)行相同任務(wù)的處理，不需要調(diào)整參數(shù)或模型結(jié)構(gòu)，這些模型開箱即用。但是這種情況一般只適用于普適任務(wù), 如：fasttest工具包中預(yù)訓(xùn)練的詞向量模型。另外，很多預(yù)訓(xùn)練模型開發(fā)者為了達(dá)到開箱即用的效果，將模型結(jié)構(gòu)分各個(gè)部分保存為不同的預(yù)訓(xùn)練模型，提供對(duì)應(yīng)的加載方法來完成特定目標(biāo)。
更主流的遷移學(xué)習(xí)方式是發(fā)揮預(yù)訓(xùn)練模型特征抽象的能力，然后再通過微調(diào)的方式，通過訓(xùn)練更新小部分參數(shù)以此來適應(yīng)不同的任務(wù)。這種遷移方式需要提供小部分的標(biāo)注數(shù)據(jù)來進(jìn)行監(jiān)督學(xué)習(xí)。

(二)、NLP中常見的預(yù)訓(xùn)練模型

(1)、常見的訓(xùn)練模型

BERT、GPT、GPT-2、Transformer-XL、XLNet、XLM、RoBERTa、DistilBERT、ALBERT、T5、XLM-RoBERTa

(2)、BERT及其變體

bert-base-uncased: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共110M參數(shù)量, 在小寫的英文文本上進(jìn)行訓(xùn)練而得到.
bert-large-uncased: 編碼器具有24個(gè)隱層, 輸出1024維張量, 16個(gè)自注意力頭, 共340M參數(shù)量, 在小寫的英文文本上進(jìn)行訓(xùn)練而得到.
bert-base-cased: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共110M參數(shù)量, 在不區(qū)分大小寫的英文文本上進(jìn)行訓(xùn)練而得到.
bert-large-cased: 編碼器具有24個(gè)隱層, 輸出1024維張量, 16個(gè)自注意力頭, 共340M參數(shù)量, 在不區(qū)分大小寫的英文文本上進(jìn)行訓(xùn)練而得到.
bert-base-multilingual-uncased: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共110M參數(shù)量, 在小寫的102種語言文本上進(jìn)行訓(xùn)練而得到.
bert-large-multilingual-uncased: 編碼器具有24個(gè)隱層, 輸出1024維張量, 16個(gè)自注意力頭, 共340M參數(shù)量, 在小寫的102種語言文本上進(jìn)行訓(xùn)練而得到.
bert-base-chinese: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共110M參數(shù)量, 在簡體和繁體中文文本上進(jìn)行訓(xùn)練而得到.

(3)、GPT

openai-gpt: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共110M參數(shù)量, 由OpenAI在英文語料上進(jìn)行訓(xùn)練而得到

(4)、GPT-2及其變體

gpt2: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共117M參數(shù)量, 在OpenAI GPT-2英文語料上進(jìn)行訓(xùn)練而得到.
gpt2-xl: 編碼器具有48個(gè)隱層, 輸出1600維張量, 25個(gè)自注意力頭, 共1558M參數(shù)量, 在大型的OpenAI GPT-2英文語料上進(jìn)行訓(xùn)練而得到.

(5)、Transformer-XL

transfo-xl-wt103: 編碼器具有18個(gè)隱層, 輸出1024維張量, 16個(gè)自注意力頭, 共257M參數(shù)量, 在wikitext-103英文語料進(jìn)行訓(xùn)練而得到

(6)、XLNet及其變體

xlnet-base-cased: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共110M參數(shù)量, 在英文語料上進(jìn)行訓(xùn)練而得到.
xlnet-large-cased: 編碼器具有24個(gè)隱層, 輸出1024維張量, 16個(gè)自注意力頭, 共240參數(shù)量, 在英文語料上進(jìn)行訓(xùn)練而得到.

(6)、XLM

xlm-mlm-en-2048: 編碼器具有12個(gè)隱層, 輸出2048維張量, 16個(gè)自注意力頭, 在英文文本上進(jìn)行訓(xùn)練而得到

(7)、RoBERTa及其變體

roberta-base: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共125M參數(shù)量, 在英文文本上進(jìn)行訓(xùn)練而得到.
roberta-large: 編碼器具有24個(gè)隱層, 輸出1024維張量, 16個(gè)自注意力頭, 共355M參數(shù)量, 在英文文本上進(jìn)行訓(xùn)練而得到.

(8)、DistilBERT及其變體

distilbert-base-uncased: 基于bert-base-uncased的蒸餾(壓縮)模型, 編碼器具有6個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共66M參數(shù)量.
distilbert-base-multilingual-cased: 基于bert-base-multilingual-uncased的蒸餾(壓縮)模型, 編碼器具有6個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共66M參數(shù)量.

(9)、ALBERT

albert-base-v1: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共125M參數(shù)量, 在英文文本上進(jìn)行訓(xùn)練而得到.
albert-base-v2: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共125M參數(shù)量, 在英文文本上進(jìn)行訓(xùn)練而得到, 相比v1使用了更多的數(shù)據(jù)量, 花費(fèi)更長的訓(xùn)練時(shí)間.

(10)、T5及其變體

t5-small: 編碼器具有6個(gè)隱層, 輸出512維張量, 8個(gè)自注意力頭, 共60M參數(shù)量, 在C4語料上進(jìn)行訓(xùn)練而得到.
t5-base: 編碼器具有12個(gè)隱層, 輸出768維張量, 12個(gè)自注意力頭, 共220M參數(shù)量, 在C4語料上進(jìn)行訓(xùn)練而得到.
t5-large: 編碼器具有24個(gè)隱層, 輸出1024維張量, 16個(gè)自注意力頭, 共770M參數(shù)量, 在C4語料上進(jìn)行訓(xùn)練而得到.

(11)、XLM-RoBERTa及其變體

xlm-roberta-base: 編碼器具有12個(gè)隱層, 輸出768維張量, 8個(gè)自注意力頭, 共125M參數(shù)量, 在2.5TB的100種語言文本上進(jìn)行訓(xùn)練而得到.
xlm-roberta-large: 編碼器具有24個(gè)隱層, 輸出1027維張量, 16個(gè)自注意力頭, 共355M參數(shù)量, 在2.5TB的100種語言文本上進(jìn)行訓(xùn)練而得到.

(三)、Transformers庫使用

(1)、Transformer庫三層應(yīng)用結(jié)構(gòu)

管道（Pipline）方式：高度集成的極簡使用方式，只需要幾行代碼即可實(shí)現(xiàn)一個(gè)NLP任務(wù)。
自動(dòng)模型（AutoMode）方式：可載入并使用BERTology系列模型。
具體模型（SpecificModel）方式：在使用時(shí)，需要明確指定具體的模型，并按照每個(gè)BERTology系列模型中的特定參數(shù)進(jìn)行調(diào)用，該方式相對(duì)復(fù)雜，但具有較高的靈活度。

(2)、編碼解碼函數(shù)

<一>、編碼

tokenizer.encode()
tokenizer.tokenize()
tokenizer.encode_plus()
tokenizer.batch_encode_plus()
tokenizer.convert_tokens_to_ids()

<1>、tokenizer.encode()

# 1、tokenizer.encode()
# 進(jìn)行分詞和token轉(zhuǎn)換，encode=tokenize+convert_tokens_to_ids
# 單個(gè)句子 or 句子列表：分開編碼，分開padding，一個(gè)句子對(duì)應(yīng)一個(gè)向量
# 句子對(duì)（pair）和句子元組（tuple）：組合編碼，統(tǒng)一padding，句子之間用 102 隔開# 單個(gè)句子，默認(rèn)只返回 input_ids
out = tokenizer.encode(text=sents[0],truncation=True,padding='max_length',max_length=20,return_tensors='pt'
)
print(out)
print(out.shape)
# exit()# pair對(duì)中的兩個(gè)句子，合并編碼，默認(rèn)只返回 input_ids
out = tokenizer.encode(text=(sents[0], sents[1]),truncation=True,padding='max_length',max_length=20,return_tensors='pt',
)
print(out)
print(out.shape)

<2>、tokenizer.tokenize()

# 2、tokenizer.tokenize()
# 只做分詞
out = tokenizer.tokenize(text=sents[:2],truncation=True,padding='max_length',max_length=20,return_tensors='pt'
)
print(out)

<3>、tokenizer.encode_plus()

# 3、tokenizer.encode_plus()
# 在encode的基礎(chǔ)之上生成input_ids、token_type_ids、attention_mask
# 單個(gè)句子編碼，默認(rèn)返回 input_ids、token_type_ids、attention_mask
out = tokenizer.encode_plus(text=sents[0],truncation=True,padding='max_length',max_length=20,return_tensors='pt'
)
print(out)
# exit()# pair對(duì)，合并編碼
# todo 注意 token_type_ids
out = tokenizer.encode_plus(text=(sents[0], sents[1]),truncation=True,padding='max_length',max_length=20,return_tensors='pt'
)
print(out)

<4>、tokenizer.batch_encode_plus()

# 4、tokenizer.batch_encode_plus()
# 在encode_plus的基礎(chǔ)之上，能夠批量梳理文本
# 批量編碼
out = tokenizer.batch_encode_plus(batch_text_or_text_pairs=sents,truncation=True,padding='max_length',max_length=20,return_tensors='pt'
)
print(out['input_ids'].shape)
# exit()# 批量編碼 成對(duì)句子
out = tokenizer.batch_encode_plus(# pair內(nèi)編碼為一句話，統(tǒng)一padding，列表內(nèi)分別編碼，分別paddingbatch_text_or_text_pairs=[(sents[0], sents[1]), (sents[2], sents[3])],truncation=True,padding='max_length',max_length=20,return_tensors='pt'
)
print(out['input_ids'].shape)

<5>、tokenizer.convert_tokens_to_ids()

# 5、tokenizer.convert_tokens_to_ids()
# convert_tokens_to_ids，將token轉(zhuǎn)化成id，在分詞之后。
# convert_ids_to_tokens,將id轉(zhuǎn)化成token，通常用于模型預(yù)測出結(jié)果，查看時(shí)使用。
out = [tokenizer.convert_tokens_to_ids(i) for i in sents[0]]
print(out)

<二>、解碼

tokenizer.decode()
tokenizer.convert_ids_to_tokens()

<1>、tokenizer.decode()

res1 = tokenizer.decode(out['input_ids'][0])
print(res1)

<2>、tokenizer.convert_ids_to_tokens()

res2 = [tokenizer.convert_ids_to_tokens(i.item()) for i in out['input_ids'][0]]
print(res2)

(3)、管道方式完成多種NLP任務(wù)

<一>、文本分類任務(wù)

文本分類是指模型可以根據(jù)文本中的內(nèi)容來進(jìn)行分類。句子級(jí)別的分類。

# 導(dǎo)入工具包
import torch
from transformers import pipeline
import numpy as np# 實(shí)現(xiàn)情感分析
def text_classcify():# 1 定義模型model = pipeline(task='sentiment-analysis', model='./model/chinese_sentiment')# 2 直接預(yù)測res = model('我愛北京天安門，天安門上太陽升。')print(res)

<二>、特征提取任務(wù)

特征抽取任務(wù)只返回文本處理后的特征，屬于預(yù)訓(xùn)練模型的范疇。特征抽取任務(wù)的輸出結(jié)果需要和其他模型一起工作。

# 實(shí)現(xiàn)特征提取，拿到詞向量，用于下游任務(wù)
def feature_extraction():# 1 創(chuàng)建piplinemodel = pipeline(task='feature-extraction', model='./model/bert-base-chinese')# 2 模型預(yù)測res = model('去碼頭整點(diǎn)薯?xiàng)l')print(res)print(type(res))print(np.array(res).shape)
# 輸出結(jié)果
# output---> <class 'list'> (1, 9, 768)
# 7個(gè)字變成9個(gè)字原因: [CLS] 去 碼 頭 整 點(diǎn) 薯 條 [SEP]

不帶任務(wù)頭輸出：特征抽取任務(wù)屬于不帶任務(wù)頭輸出，本bert-base-chinese模型的9個(gè)字，每個(gè)字的特征維度是768
帶頭任務(wù)頭輸出：其他有指定任務(wù)類型的比如文本分類，完型填空屬于帶頭任務(wù)輸出，會(huì)根據(jù)具體任務(wù)類型不同輸出不同的結(jié)果

<三>、完形填空任務(wù)

完型填空任務(wù)又被叫做“遮蔽語言建模任務(wù)”，它屬于BERT模型訓(xùn)練過程中的子任務(wù)。分類任務(wù)。

# 完形填空任務(wù)
def fill_mask():model = pipeline(task='fill-mask', model='./model/chinese-bert-wwm')res = model('我想明天去[MASK]家吃飯。')print(res)# 輸出結(jié)果
# output--->
# [{'score': 0.34331339597702026, 'token': 1961, 'token_str': '她', 'sequence': '我 想 明 天 去 她 家 吃 飯.'},
# {'score': 0.2533259987831116, 'token': 872, 'token_str': '你', 'sequence': '我 想 明 天 去 你 家 吃 飯.'},
# {'score': 0.1874391734600067, 'token': 800, 'token_str': '他', 'sequence': '我 想 明 天 去 他 家 吃 飯.'},
# {'score': 0.1273055076599121, 'token': 2769, 'token_str': '我', 'sequence': '我 想 明 天 去 我 家 吃 飯.'},
# {'score': 0.02162978984415531, 'token': 2644, 'token_str': '您', 'sequence': '我 想 明 天 去 您 家 吃 飯.'}]

<四>、閱讀理解任務(wù)

閱讀理解任務(wù)又稱為“抽取式問答任務(wù)”，即輸入一段文本和一個(gè)問題，讓模型輸出結(jié)果。

# 閱讀理解
def qa():context = '我叫張三，我是一個(gè)程序員，我的喜好是打籃球。'questions = ['我是誰？', '我是做什么的？', '我的愛好是什么？']model = pipeline(task='question-answering', model='./model/chinese_pretrain_mrc_roberta_wwm_ext_large')res = model(context=context, question=questions)print(res)# 輸出結(jié)果
'''
[{'score': 1.2071758523357623e-12, 'start': 2, 'end': 4, 'answer': '張三'},{'score': 2.60890374192968e-06, 'start': 9, 'end': 12, 'answer': '程序員'},{'score': 4.1686924134864967e-08, 'start': 18, 'end': 21, 'answer': '打籃球'}]
'''

<五>、文本摘要任務(wù)

摘要生成任務(wù)的輸入一一段文本，輸出是一段概況、簡單的文字。

# 5 文本摘要
def summary():model = pipeline(task='summarization', model='./model/distilbart-cnn-12-6')text = 'BERT is a transformers model pretrained on a large corpus of English data " \"in a self-supervised fashion. This means it was pretrained on the raw texts " \"only, with no humans labelling them in any way (which is why it can use lots " \"of publicly available data) with an automatic process to generate inputs and " \"labels from those texts. More precisely, it was pretrained with two objectives:Masked " \"language modeling (MLM): taking a sentence, the model randomly masks 15% of the " \"words in the input then run the entire masked sentence through the model and has " \"to predict the masked words. This is different from traditional recurrent neural " \"networks (RNNs) that usually see the words one after the other, or from autoregressive " \"models like GPT which internally mask the future tokens. It allows the model to learn " \"a bidirectional representation of the sentence.Next sentence prediction (NSP): the models" \" concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to " \"sentences that were next to each other in the original text, sometimes not. The model then " \"has to predict if the two sentences were following each other or not.'res = model(text)print(res)# 輸出結(jié)果
output---> [{'summary_text': ' BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion . It was pretrained with two objectives: Masked language modeling (MLM) and next sentence prediction (NSP) This allows the model to learn a bidirectional representation of the sentence .'}]

<六>、NER任務(wù)

**實(shí)體詞識(shí)別（NER）**任務(wù)是NLP中的基礎(chǔ)任務(wù)。它用于識(shí)別文本中的人名（PER）、地名（LOC）、組織（ORG）以及其他實(shí)體（MISC）等。例如：(王 B-PER) (小 I-PER) (明 I-PER) (在 O) (北 B-LOC) (京 I-LOC)。其中O表示一個(gè)非實(shí)體，B表示一個(gè)實(shí)體的開始，I表示一個(gè)實(shí)體塊的內(nèi)部。

實(shí)體詞識(shí)別本質(zhì)上是一個(gè)分類任務(wù)（又叫序列標(biāo)注任務(wù)：token級(jí)別的分類任務(wù)），實(shí)體詞識(shí)別是句法分析的基礎(chǔ)，而句法分析優(yōu)勢NLP任務(wù)的核心。

ner（命名實(shí)體識(shí)別、實(shí)體抽取）：兩階段分別是 邊界識(shí)別 and 實(shí)體分類
常見的命名實(shí)體：人名、地名、機(jī)構(gòu)名、時(shí)間、日期、貨幣、百分比
句子里邊的關(guān)鍵信息，一般由命名實(shí)體承載，場景：意圖識(shí)別、關(guān)鍵詞抽取、知識(shí)圖譜
信息抽取：實(shí)體抽取、關(guān)系抽取、事件抽取（屬性抽取）

def ner():model = pipeline(task='ner', model='./model/roberta-base-finetuned-cluener2020-chinese')res = model('特朗普第二次擔(dān)任了美國總統(tǒng)')print(res)

(4)、自動(dòng)模型方式完成多種NLP任務(wù)

AutoTokenizer、AutoModelForSequenceClassification函數(shù)可以自動(dòng)從官網(wǎng)下載預(yù)訓(xùn)練模型，也可以加載本地的預(yù)訓(xùn)練模型
AutoModelForSequenceClassification類管理著分類任務(wù)，會(huì)根據(jù)參數(shù)的輸入選用不同的模型。
AutoTokenizer的encode()函數(shù)使用return_tensors=’pt‘參數(shù)和不使用pt參數(shù)對(duì)文本編碼的結(jié)果不同
AutoTokenizer的encode()函數(shù)使用padding='max_length’可以按照最大程度進(jìn)行補(bǔ)齊，俗稱打padding
調(diào)用模型的forward函數(shù)輸入return_dict=False參數(shù)，返回結(jié)果也不同

<一>、文本分類任務(wù)

# 導(dǎo)入工具包
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer
from transformers import AutoModelForSequenceClassification, AutoModelForMaskedLM, AutoModelForQuestionAnswering
# AutoModelForSeq2SeqLM：文本摘要
# AutoModelForTokenClassification：ner
from transformers import AutoModelForSeq2SeqLM, AutoModelForTokenClassification# 實(shí)現(xiàn)文本分類
def text_classify():# chinese_sentiment 是一個(gè)5分類# 1 加載切詞器：分詞+word2id（BPE）my_tokenizer = AutoTokenizer.from_pretrained('./model/chinese_sentiment')# 2 加載模型 # SequenceClassification 句子級(jí)別的分類# TokenClassification token級(jí)別的分類my_model = AutoModelForSequenceClassification.from_pretrained('./model/chinese_sentiment')# 3 準(zhǔn)備數(shù)據(jù)樣本句子# message = '人生該如何起頭'# message = '我的人生很灰暗'# message = '我的人生很輝煌'message = '我不同意你的看法'# message = '我對(duì)你的看法表示中立'# message = '我很同意你的看法'# message = '你的看法太棒了，我非常同意'# message = ['艾海兩只黃鸝鳴翠柳', '一行白鷺上青天']# 4 對(duì)句子進(jìn)行編碼 encodeoutput1 = my_tokenizer.encode(message,return_tensors='pt',  # 可選 pt（torch tensor） tf（tensorflow） None（list）truncation=True,  # 超過 max-len 就進(jìn)行截?cái)?/span>padding='max_length',  # True 根據(jù)最長的句子進(jìn)行補(bǔ)齊；’max_length‘ 根據(jù)設(shè)置的max_length進(jìn)行補(bǔ)齊max_length=20  # 設(shè)置句子的最大長度)print(output1)print(output1.shape)# exit()# 不設(shè)置 pt，返回 listoutput2 = my_tokenizer.encode(message,# return_tensors='pt',truncation=True,padding=True,max_length=20)print(output2)# 5 使用模型進(jìn)行預(yù)測my_model.eval()  # 開啟模型預(yù)測驗(yàn)證result = my_model(output1)print(result)result2 = my_model(output1, return_dict=False)print(result2)# 結(jié)果分析topv, topi = torch.topk(result.logits, k=1, dim=-1)print('star', topi.item())

<二>、特征提取任務(wù)

# 特征提取 拿到詞向量 句向量
# todo token_type_ids attention_mask
def feature_extraction():# 1 加載分詞器my_tokenizer = AutoTokenizer.from_pretrained('./model/bert-base-chinese')print(my_tokenizer)# 2 加載模型my_model = AutoModel.from_pretrained('./model/bert-base-chinese')# 3 準(zhǔn)備樣本message = ['你是誰', '人生該如何起頭']# 4 樣本編碼output = my_tokenizer.batch_encode_plus(message,  # message 句子列表，有多句話，所以用的 batch_encode_plusreturn_tensors='pt',truncation=True,padding='max_length',  # 不夠就補(bǔ)0max_length=20,)# output 一般有3個(gè)kv對(duì)，input_ids 就是token具體的編碼結(jié)果 前后加 [CLS] [SEP]# 'input_ids': tensor([# 		[101, 872, 3221, 6443, 102, 0, 0, 0, 0, 0, 0, 0,# 			0, 0, 0, 0, 0, 0, 0, 0# 		],# 		[101, 782, 4495, 6421, 1963, 862, 6629, 1928, 102, 0, 0, 0,# 			0, 0, 0, 0, 0, 0, 0, 0# 		]# attention_mask 標(biāo)識(shí) padding 的位置 為 0，正常的有意義的 token 標(biāo)識(shí)為 1# 'attention_mask': tensor([# 		[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],# 		[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]# 	])# token_type_ids 在一條樣本內(nèi)部，第一個(gè)句子標(biāo)識(shí)為 0，第二個(gè)句子表示為 1# 'token_type_ids': tensor([# 		[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  # 因?yàn)榇颂幹挥幸粋€(gè)句子，所以只有0# 		[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]# 	]),print(output)# 5 將數(shù)據(jù)送入到模型my_model.eval()res = my_model(**output)print(res)print(res.last_hidden_state.shape)  # 詞向量 ner token級(jí)別的分類print(res.pooler_output.shape)  # 句向量 文本分類

<三>、完形填空任務(wù)

# 完形填空
def fill_mask():# 1 加載分詞器my_tokenizer = AutoTokenizer.from_pretrained('./model/chinese-bert-wwm')print(my_tokenizer)# 2 加載模型my_model = AutoModelForMaskedLM.from_pretrained('./model/chinese-bert-wwm')# 3 準(zhǔn)備樣本message = '我想明天去[MASK]家吃飯'# 4 對(duì)樣本進(jìn)行編碼output = my_tokenizer.encode_plus(message, return_tensors='pt')print(output)# 5 將數(shù)據(jù)送入模型my_model.eval()res = my_model(**output).logits  # res.shape=(1,11,vocab_size)print(res.shape)# res[0][6]：[mask]對(duì)應(yīng)的輸出向量，長度vocab_size# torch.argmax(res[0][6]) 拿到最大值所在下表索引index = torch.argmax(res[0][6]).item()print(index)# 6 拿到 mask 對(duì)應(yīng)的 tokentoken = my_tokenizer.convert_ids_to_tokens(index)print(token)

<四>、閱讀理解任務(wù)

def qa():# 1 加載分詞器my_tokenizer = AutoTokenizer.from_pretrained('./model/chinese_pretrain_mrc_roberta_wwm_ext_large')# 2 加載模型my_model = AutoModelForQuestionAnswering.from_pretrained('./model/chinese_pretrain_mrc_roberta_wwm_ext_large')# 3 準(zhǔn)備語料context = '我叫張三 我是一個(gè)程序員 我的喜好是打籃球'questions = ['我是誰？', '我是做什么的？', '我的愛好是什么？']# 4 將數(shù)據(jù)送入模型my_model.eval()for question in questions:print(question)# pair對(duì) 進(jìn)行統(tǒng)一合并編碼inputs = my_tokenizer.encode_plus(question, context, return_tensors='pt')outputs = my_model(**inputs)print(outputs)# 拿到輸出的開始的logit，結(jié)束的logit，通過argmax拿到indexstart_index = torch.argmax(outputs.start_logits, dim=-1).item()end_index = torch.argmax(outputs.end_logits, dim=-1).item()# 來到inputs中做切片, a 是id序列a = inputs['input_ids'][0][start_index: end_index + 1]# 對(duì) a 進(jìn)行解碼res = my_tokenizer.convert_ids_to_tokens(a)print(res)

<五>、文本摘要任務(wù)

def summary():# 1 加載分詞器my_tokenizer = AutoTokenizer.from_pretrained('./model/distilbart-cnn-12-6')# 2 加載模型my_model = AutoModelForSeq2SeqLM.from_pretrained('./model/distilbart-cnn-12-6')# 3 準(zhǔn)備語料# text = "BERT is a transformers model pretrained on a large corpus of English data " \#        "in a self-supervised fashion. This means it was pretrained on the raw texts " \#        "only, with no humans labelling them in any way (which is why it can use lots " \#        "of publicly available data) with an automatic process to generate inputs and " \#        "labels from those texts. More precisely, it was pretrained with two objectives:Masked " \#        "language modeling (MLM): taking a sentence, the model randomly masks 15% of the " \#        "words in the input then run the entire masked sentence through the model and has " \#        "to predict the masked words. This is different from traditional recurrent neural " \#        "networks (RNNs) that usually see the words one after the other, or from autoregressive " \#        "models like GPT which internally mask the future tokens. It allows the model to learn " \#        "a bidirectional representation of the sentence.Next sentence prediction (NSP): the models" \#        " concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to " \#        "sentences that were next to each other in the original text, sometimes not. The model then " \#        "has to predict if the two sentences were following each other or not."text = 'I have a dream.'# 4 把文本進(jìn)行張量表示inputs = my_tokenizer.encode_plus(text, return_tensors='pt')# 5 將數(shù)據(jù)送入模型，進(jìn)行解碼my_model.eval()outputs = my_model.generate(inputs['input_ids'])print(outputs)# skip_special_tokens=True 跳過特殊符號(hào) BERT的特殊符號(hào) CLS SEP PAD UNK MASK# clean_up_tokenization_spaces=False 是否清理空字符res = my_tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)print(res)

<六>、NER任務(wù)

命名實(shí)體識(shí)別、實(shí)體抽取

def ner():# 1 加載分詞器、模型、配置configmy_tokenizer = AutoTokenizer.from_pretrained('./model/roberta-base-finetuned-cluener2020-chinese')my_model = AutoModelForTokenClassification.from_pretrained('./model/roberta-base-finetuned-cluener2020-chinese')my_config = AutoConfig.from_pretrained('./model/roberta-base-finetuned-cluener2020-chinese')# 2 準(zhǔn)備數(shù)據(jù)，并進(jìn)行張量化text = '我愛北京天安門，天安門上太陽升'inputs = my_tokenizer.encode_plus(text, return_tensors='pt')print('inputs: ', inputs)# 3 將tensor送入到模型，拿到 id-token，因?yàn)?inputs 已經(jīng)添加了特殊符號(hào)my_model.eval()# logits 是模型返回的主要張量，用來做token分類的outputs = my_model(**inputs).logits# 因?yàn)樵瓉淼?inputs 已經(jīng)添加了特殊符號(hào)，而特殊符號(hào)也需要標(biāo)簽labeltokens = my_tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])print('tokens:', tokens)# 4 預(yù)測結(jié)果# 初始化返回結(jié)果的列表output_listoutput_list = []# 循環(huán)遍歷 具體的token，及其對(duì)應(yīng)的標(biāo)簽tensor，這個(gè)tensor未經(jīng)過argmax的# logit 是一個(gè)一維的tensorfor token, logit in zip(tokens, outputs[0]):# 跳過特殊符號(hào)if token in my_tokenizer.all_special_tokens:continueindex = torch.argmax(logit, dim=-1).item()# 根據(jù)id拿到具體的標(biāo)簽labellabel = my_config.id2label[index]# 封裝 token及其標(biāo)簽 進(jìn) output_listoutput_list.append((token, label))print(output_list)

(四)、微調(diào)方式進(jìn)行遷移學(xué)習(xí)

(1)、遷移學(xué)習(xí)的兩種類型

直接加載預(yù)訓(xùn)練模型進(jìn)行輸入文本的特征表示, 后接自定義網(wǎng)絡(luò)進(jìn)行微調(diào)輸出結(jié)果
使用指定任務(wù)類型的微調(diào)腳本微調(diào)預(yù)訓(xùn)練模型, 后接帶有輸出頭的預(yù)定義網(wǎng)絡(luò)輸出結(jié)果

(2)、中文分類

<一>、導(dǎo)包

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset  # 用來加載數(shù)據(jù)
from transformers import BertModel, BertTokenizer
from transformers import AdamW
import time
import shutup

<二>、加載分詞器和模型

shutup.please()  # 去掉無意義的警告# 定義設(shè)備
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')# 加載分詞器
my_tokenizer = BertTokenizer.from_pretrained('./model/bert-base-chinese')# 加載模型, my_model對(duì)應(yīng)的下游任務(wù)模型
bert_model = BertModel.from_pretrained('./model/bert-base-chinese')

<三>、加載數(shù)據(jù)

# 3 使用 load_dataset 加載數(shù)據(jù)
def file2dataset():'''load_dataset 3種情況情況1 data_files如果傳入的是一個(gè)字典（不同的類型對(duì)應(yīng)的不同的數(shù)據(jù)文件，這個(gè)類型就是split）情況2 如果直接傳入數(shù)據(jù)文件路徑，直接寫死 split='train' 沒有意義了情況3 不使用data_files，使用data_dir，那么直接返回一個(gè) dataset dict，根據(jù) key 去檢索數(shù)據(jù)文件 dataset['train]'''my_files = {'train': './data/train.csv','test': './data/test.csv','valid': './data/validation.csv',}# 加載訓(xùn)練集, load_dataset 三個(gè)參數(shù)：數(shù)據(jù)文件格式、文件路徑、類型train_dataset = load_dataset('csv', data_files=my_files, split='train')print(train_dataset[0])# 測試集test_dataset = load_dataset('csv', data_files=my_files, split='test')# 驗(yàn)證集valid_dataset = load_dataset('csv', data_files=my_files, split='valid')return train_dataset, test_dataset, valid_dataset

<四>、對(duì)同一批次的數(shù)據(jù)做標(biāo)準(zhǔn)化

# 4 自定義批處理函數(shù), 對(duì)一個(gè)批次的數(shù)據(jù)做標(biāo)準(zhǔn)化
def collate_fn(data):# data=[{text:xxx, label:1}, {},{},,,]# 主要作用：對(duì)句子長度進(jìn)行規(guī)范化，規(guī)范到統(tǒng)一的標(biāo)準(zhǔn)長度sents = [i['text'] for i in data]labels = [i['label'] for i in data]# print(sents)# print(labels)inputs = my_tokenizer.batch_encode_plus(sents,# 是否截?cái)?/span>truncation=True,# paddingpadding='max_length',max_length=500,# 返回tensor，默認(rèn)列表return_tensors='pt',# 返回長度return_length=True)input_ids = inputs['input_ids']token_type_ids = inputs['token_type_ids']attention_mask = inputs['attention_mask']labels = torch.LongTensor(labels)return input_ids, token_type_ids, attention_mask, labels

<五>、獲得dataloader

# 5 測試數(shù)據(jù)集，獲得dataloader
def get_dataloader():train_dataset = load_dataset('csv', data_files='./data/train.csv', split='train')my_dataloader = DataLoader(train_dataset,batch_size=8,  # 一個(gè) batch 有 8 條樣本shuffle=True,  # 將數(shù)據(jù)打亂collate_fn=collate_fn,  # 批處理函數(shù)，統(tǒng)一句子長度drop_last=True,  # 刪除最后一個(gè)不足一個(gè) batch 的數(shù)據(jù))# 通過 next(iter()) 方法拿到一個(gè)batch的數(shù)據(jù)input_ids, token_type_ids, attention_mask, labels = next(iter(my_dataloader))# print(input_ids)# print(token_type_ids)# print(labels)return my_dataloader

<六>、自定義下游任務(wù)模型

# 6 自定義下游任務(wù)模型
class MyModel(nn.Module):def __init__(self):super().__init__()# 適配下游任務(wù)，一個(gè)線性層self.linear = nn.Linear(768, 2)def forward(self, input_ids, token_type_ids, attention_mask):  # 參數(shù)均為編碼的結(jié)果# 不對(duì)預(yù)訓(xùn)練模型的參數(shù)更新with torch.no_grad():# 輸出兩個(gè)值，詞向量和句向量bert_output = bert_model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)output = self.linear(bert_output.pooler_output)  # 可以不做 softmax，后續(xù)有嵌套return output

<七>、模型訓(xùn)練

# 7 模型訓(xùn)練
def train_model():# 1 準(zhǔn)備物料: 模型、損失函數(shù)、優(yōu)化器、數(shù)據(jù)my_model = MyModel().to(device)# 把bert參數(shù)固定住for param in bert_model.parameters():param.requires_grad_(False)# CrossEntropyLoss 本身自帶 softmaxmy_loss_fn = nn.CrossEntropyLoss(reduction='mean')my_adamw = AdamW(my_model.parameters(), lr=3e-4)# 拿到數(shù)據(jù)my_dataloader = get_dataloader()# 2 開始訓(xùn)練my_model.train()epochs = 3for epoch_idx in range(epochs):# 記錄開始時(shí)間start_time = time.time()for i, (input_ids, token_type_ids, attention_mask, labels) in enumerate(my_dataloader, start=1):input_ids = input_ids.to(device)token_type_ids = token_type_ids.to(device)attention_mask = attention_mask.to(device)labels = labels.to(device)# 訓(xùn)練的4步：前向傳播、計(jì)算損失、損失反向傳播、參數(shù)更新、梯度清零output = my_model(input_ids, token_type_ids, attention_mask)loss = my_loss_fn(output, labels)loss.backward()my_adamw.step()my_adamw.zero_grad()# 每隔幾步 打印日志if i % 2 == 0:# 根據(jù) argmax 拿到預(yù)測值 idxtem = torch.argmax(output, dim=-1)# 計(jì)算準(zhǔn)確率，預(yù)測正確的 / 總的數(shù)量acc = (tem == labels).sum().item() / len(labels)use_time = time.time() - start_timeprint('當(dāng)前訓(xùn)練輪次%d,迭代步數(shù)%d,損失%.2f,準(zhǔn)確率%.2f,時(shí)間%d' % (epoch_idx + 1,i,loss.item(),acc,use_time))# 每隔epoch保存一次模型torch.save(my_model.state_dict(), './save/classify_%d.bin' % (epoch_idx + 1))

<八>、模型評(píng)估

# 9 模型評(píng)估
def ceshi_model():# 1 準(zhǔn)備物料 必要 模型 數(shù)據(jù) 也可以有損失test_dataset = load_dataset('csv', data_files='./data/test.csv', split='train')test_dataloader = DataLoader(test_dataset,batch_size=8,shuffle=True,collate_fn=collate_fn,drop_last=True,)my_model = MyModel()my_model.load_state_dict(torch.load('./save/classify_3.bin'))# 2 開始測試correct = 0  # 預(yù)測正確的樣本數(shù)量total = 0  # 總的樣本數(shù)量# 開啟模型驗(yàn)證模式my_model.eval()# 只需要一個(gè)epoch即可for i, (input_ids, token_type_ids, attention_mask, labels) in enumerate(test_dataloader, start=1):# 數(shù)據(jù)放到 with torch.no_grad() 執(zhí)行with torch.no_grad():output = my_model(input_ids, token_type_ids, attention_mask)temp = torch.argmax(output, dim=-1)# 把當(dāng)前batch的預(yù)測正確的、總的數(shù)量分別加到correct、totalcorrect += (temp == labels).sum().item()total += len(labels)# 打印日志if i % 2 == 0:print('平均acc：', correct / total)text = my_tokenizer.decode(input_ids[0], skip_special_tokens=True)print('當(dāng)前batch第一個(gè)原始文本：', text)print('模型的預(yù)測結(jié)果', temp[0])print('真是結(jié)果是：', labels[0])print('模型總的acc：', correct / total)

查看全文

http://m.aloenet.com.cn/news/38994.html

国产亚洲精品福利在线无卡一,国产精久久一区二区三区,亚洲精品无码国模,精品久久久久久无码专区不卡

遷移學(xué)習(xí)

1、fasttext概述

2、fasttext模型架構(gòu)

(一)、層次softmax

(1)、哈夫曼樹

(2)、構(gòu)建哈夫曼樹

(3)、哈夫曼樹編碼

(二)、負(fù)采樣

(1)、策略

(2)、優(yōu)勢

3、fasttext文本分類

4、訓(xùn)練詞向量

(一)、訓(xùn)練詞向量的過程：

(二)、API

5、詞向量遷移

6、遷移學(xué)習(xí)

(一)、概述

(1)、預(yù)訓(xùn)練模型

(2)、微調(diào)

(3)、兩種遷移方式

(二)、NLP中常見的預(yù)訓(xùn)練模型

(1)、常見的訓(xùn)練模型

(2)、BERT及其變體

(3)、GPT

(4)、GPT-2及其變體

(5)、Transformer-XL

(6)、XLNet及其變體

(6)、XLM

(7)、RoBERTa及其變體

(8)、DistilBERT及其變體

(9)、ALBERT

(10)、T5及其變體

(11)、XLM-RoBERTa及其變體

(三)、Transformers庫使用

(1)、Transformer庫三層應(yīng)用結(jié)構(gòu)

(2)、編碼解碼函數(shù)

<一>、編碼

<1>、tokenizer.encode()

<2>、tokenizer.tokenize()

<3>、tokenizer.encode_plus()

<4>、tokenizer.batch_encode_plus()

<5>、tokenizer.convert_tokens_to_ids()

<二>、解碼

<1>、tokenizer.decode()

<2>、tokenizer.convert_ids_to_tokens()

(3)、管道方式完成多種NLP任務(wù)

<一>、文本分類任務(wù)

<二>、特征提取任務(wù)

<三>、完形填空任務(wù)

<四>、閱讀理解任務(wù)

<五>、文本摘要任務(wù)

<六>、NER任務(wù)

(4)、自動(dòng)模型方式完成多種NLP任務(wù)

<一>、文本分類任務(wù)

<二>、特征提取任務(wù)

<三>、完形填空任務(wù)

<四>、閱讀理解任務(wù)

<五>、文本摘要任務(wù)

<六>、NER任務(wù)

(四)、微調(diào)方式進(jìn)行遷移學(xué)習(xí)

(1)、遷移學(xué)習(xí)的兩種類型

(2)、中文分類

<一>、導(dǎo)包

<二>、加載分詞器和模型

<三>、加載數(shù)據(jù)

<四>、對(duì)同一批次的數(shù)據(jù)做標(biāo)準(zhǔn)化

<五>、獲得dataloader

<六>、自定義下游任務(wù)模型

<七>、模型訓(xùn)練

<八>、模型評(píng)估

相關(guān)文章：

(一)、層次softmax

(1)、哈夫曼樹

(2)、構(gòu)建哈夫曼樹

(二)、負(fù)采樣

(1)、策略

3、fasttext文本分類

4、訓(xùn)練詞向量

(一)、訓(xùn)練詞向量的過程：

(二)、API

(一)、概述

(2)、微調(diào)

(3)、兩種遷移方式

(二)、NLP中常見的預(yù)訓(xùn)練模型

(2)、BERT及其變體

(3)、GPT

(4)、GPT-2及其變體

(6)、XLNet及其變體

(7)、RoBERTa及其變體

(8)、DistilBERT及其變體

(9)、ALBERT

(10)、T5及其變體

(11)、XLM-RoBERTa及其變體

(三)、Transformers庫使用

(2)、編碼解碼函數(shù)

<1>、tokenizer.encode()

<3>、tokenizer.encode_plus()

<5>、tokenizer.convert_tokens_to_ids()

<2>、tokenizer.convert_ids_to_tokens()

<一>、文本分類任務(wù)

<二>、特征提取任務(wù)

<三>、完形填空任務(wù)

<四>、閱讀理解任務(wù)

<五>、文本摘要任務(wù)

<六>、NER任務(wù)

(4)、自動(dòng)模型方式完成多種NLP任務(wù)

<二>、特征提取任務(wù)

<四>、閱讀理解任務(wù)

<六>、NER任務(wù)

(四)、微調(diào)方式進(jìn)行遷移學(xué)習(xí)

(1)、遷移學(xué)習(xí)的兩種類型

<一>、導(dǎo)包

<二>、加載分詞器和模型

<三>、加載數(shù)據(jù)

<四>、對(duì)同一批次的數(shù)據(jù)做標(biāo)準(zhǔn)化

<五>、獲得dataloader

<六>、自定義下游任務(wù)模型

<七>、模型訓(xùn)練

<八>、模型評(píng)估