網(wǎng)站400百度關鍵詞優(yōu)化手段
目錄
- 0. 本欄目競賽匯總表
- 1. 本文主旨
- 2. AI工程架構
- 3. 數(shù)據(jù)預處理模塊
- 3.1 配置數(shù)據(jù)路徑和處理參數(shù)
- 3.2 配置API參數(shù)
- 3.3 配置輸出路徑
- 4. AI并行處理模塊
- 4.1 定義LLM客戶端類
- 4.2 定義數(shù)據(jù)處理函數(shù)
- 4.3 定義JSON保存函數(shù)
- 4.4 定義數(shù)據(jù)分片函數(shù)
- 4.5 定義分片處理函數(shù)
- 4.5 定義文件名排序函數(shù)
- 5. 數(shù)據(jù)整合模塊
- 5.1 加載數(shù)據(jù)并生成分片
- 5.2 初始化LLM客戶端并測試
- 5.3 并行處理數(shù)據(jù)生成
- 5.4 合并處理結果
- 5.5 保存最終結果
0. 本欄目競賽匯總表
Kaggle競賽匯總
1. 本文主旨
- 大白話:由于在上一篇文章的數(shù)據(jù)探索中,我們發(fā)現(xiàn)了部分訓練數(shù)據(jù)的錯誤解釋存在缺失,因此直接使用GPT_4o+人設提示詞工程,對訓練集數(shù)據(jù)存在的錯誤解釋缺失問題的處理。
- 通過本文可收獲技能:API調用AI接口、人設提示詞工程案例、復雜的數(shù)據(jù)處理與緩存處理。
- 上文回顧:Eedi大模型蒸餾方案01-競賽信息解讀與數(shù)據(jù)理解
2. AI工程架構
3. 數(shù)據(jù)預處理模塊
3.1 配置數(shù)據(jù)路徑和處理參數(shù)
data_path = "~/work/eedi_synthetic_data/MalAlgoQA_format.csv"
index_start = 0
index_end = len(df)
step = 100
max_workers = 2
3.2 配置API參數(shù)
model_config = dict(openai_api_base = "https://testshellapi.kimi.asia/v1", api_key = "****",model = "gpt-4o",default_system_prompt = """##TaskYou are a Mathematics teacher. Your task is to reason and identify the ConstructName and SubjectName and then the misconception behind the user input Incorrect Answers with the Question.ConstructName is Most granular level of knowledge related to question, appears to describe the specific mathematical method or procedure used to solve the question. It explains the technique or approach needed to reach the answer.SubjectName is More general context than the construct, represents the broader mathematical topic or category that the question belongs to.Misconceptions are a mistake in conceptual understanding and they have relations with all the applications of those concepts. For example, a single misconception on the connections among proportional relationships (part/whole, part/part, whole/part) can cause problems in identifying those patterns in drawings and can be the cause of failing to realize all parts must be of equal size, therefore associating the denominator of the fraction with the total number of parts regardless their size.Answer concisely what misconception it is to lead to getting the incorrect answer.Do not use "The misconception is" to start your answers.Do not mention the concrete details of the question or answers. ##User inputQuestion: The question textA: multiple choice answer A textB: multiple choice answer B textC: multiple choice answer C textD: multiple choice answer D textCorrect Answer: The correct answer text##You should answer in the following JSON format{"ConstructName": "here writes the constructName","SubjectName": "here writes the SubjectName""MisconceptionAName": "here writes the answer A's misconception.","MisconceptionBName": "here writes the answer B's misconception.","MisconceptionCName": "here writes the answer C's misconception.","MisconceptionDName": "here writes the answer D's misconception.",}""", # system prompt,default_temperature = 0.5,max_tokens = 256,
)
3.3 配置輸出路徑
cache_folder = f"./cache_{model_config['model']}_model_misconceptions_result"
if not os.path.exists(cache_folder):os.makedirs(cache_folder)
output_data_path = f"misconception_data_{os.path.splitext(os.path.basename(data_path))[0]}_{model_config['model']}.csv"
4. AI并行處理模塊
4.1 定義LLM客戶端類
class LLMChat:def __init__(self, openai_api_base, api_key, model, default_temperature, default_system_prompt, max_tokens=512):self.client = OpenAI(api_key = api_key,base_url=openai_api_base,)self.model = modelself.default_temperature = default_temperatureself.default_system_prompt = default_system_promptself.max_tokens = max_tokensdef chat(self, user_prompt, system_prompt=None, temperature=None):if not system_prompt:system_prompt = self.default_system_promptif not temperature:temperature = self.default_temperaturechat_response = self.client.chat.completions.create(model=self.model,temperature=temperature,messages=[{"role": "system", "content": system_prompt},{"role": "user", "content": user_prompt},],max_tokens=self.max_tokens,response_format={"type": "json_object"})return chat_response.choices[0].message.content
4.2 定義數(shù)據(jù)處理函數(shù)
def process_row(args, debug=False):user_prompt = """Question: {question}A: {answer_a}B: {answer_b}C: {answer_c}D: {answer_d}Correct Answer: {correct_answer}"""index, row = argsca = row["CorrectAnswer"]correctanswer = row[f"Answer{ca}Text"]input_user_prompt = user_prompt.format(question=row['QuestionText'],answer_a=row['AnswerAText'],answer_b=row['AnswerBText'],answer_c=row['AnswerCText'],answer_d=row['AnswerDText'],correct_answer=correctanswer,)ret_data = {}try:ret_data = vc.chat(input_user_prompt)if debug:print(ret_data+'\n')except Exception as e:print(f'An exception occur {str(e)}')ret_data['error'] = str(e)passif debug:print('system: ', model_config['default_system_prompt'])print('>'* 50)print('user_input: ', input_user_prompt)print('>'* 50)print('assistant: ', ret_data)return ret_data
4.3 定義JSON保存函數(shù)
def save_json(fn, obj):with open(fn, 'w') as f:json.dump(obj, f, ensure_ascii=False, indent=4)print(f"save file to {fn}")
4.4 定義數(shù)據(jù)分片函數(shù)
def slice_range(start, end, step):if step <= 0:raise ValueError("步長必須大于0")result = []while start <= end:result.append(start)start += stepif result[-1] < end:result.append(end)return result
4.5 定義分片處理函數(shù)
def process_pairs(sliced_range):slices = []for first, second in zip(sliced_range, sliced_range[1:]):slices.append([first, second])return slices
4.5 定義文件名排序函數(shù)
def natural_sort_key(filename):parts = re.findall(r'\d+', filename)return tuple(map(int, parts))
5. 數(shù)據(jù)整合模塊
5.1 加載數(shù)據(jù)并生成分片
df = pd.read_csv(data_path)
df.head()
sliced_range = process_pairs(slice_range(index_start, index_end, step))
df數(shù)據(jù)檢查:
5.2 初始化LLM客戶端并測試
vc = LLMChat(**model_config)
r = process_row((7, df.iloc[7]), debug=True)
5.3 并行處理數(shù)據(jù)生成
for slices in tqdm(sliced_range, total=len(sliced_range)):output_filepath = f'{cache_folder}/cache_res_{slices[0]}.json'if os.path.exists(output_filepath):print(f'cache file exists, skip {output_filepath}')continuedf_tasks = df.iloc[slices[0]:slices[1]]results = []with ProcessPoolExecutor(max_workers=max_workers) as executor:results = list(tqdm(executor.map(process_row, df_tasks.iterrows()), total=len(df_tasks)))save_json(output_filepath, results)
5.4 合并處理結果
f_names = glob.glob(f'{cache_folder}/*.json')
sorted_filenames = sorted(f_names, key=natural_sort_key)
f_names = sorted_filenamesresults = []
for fn in f_names:with open(fn, 'r') as f:batch_results = json.load(f)results.extend(batch_results)l = len(results)
results = [json.loads(r) for r in results]
5.5 保存最終結果
df = df.iloc[:l]
gen_df = pd.DataFrame(results)
df = pd.concat([df, gen_df], axis=1)
df.to_csv(output_data_path, index=False)
(To be continued)