當前位置：首頁 > news >正文

網站建設教程實訓心得創(chuàng)建網站免費注冊

news 2025/7/2 23:43:30

網站建設教程實訓心得,創(chuàng)建網站免費注冊,微網站開發(fā)教材,互聯(lián)網技術發(fā)展及其影響的調查文章目錄 📚實驗目的📚實驗平臺📚實驗內容🐇在本地編寫程序和調試🥕代碼框架思路🥕代碼實現 🐇在集群上提交作業(yè)并執(zhí)行🥕在集群上提交作業(yè)并執(zhí)行，同本地執(zhí)行相比即需修改…

文章目錄

📚實驗目的
📚實驗平臺
📚實驗內容
- 🐇在本地編寫程序和調試
- - 🥕代碼框架思路
  - 🥕代碼實現
- 🐇在集群上提交作業(yè)并執(zhí)行
- - 🥕在集群上提交作業(yè)并執(zhí)行，同本地執(zhí)行相比即需修改路徑。
  - 🥕修改后通過expoet，導出jar包，關注 Main-Class 的設置！
  - 🥕在終端依次輸入以下指令，完成提交

📚實驗目的

倒排索引（Inverted Index）被用來存儲在全文搜索下某個單詞在一個文檔或者一組文檔中的存儲位置的映射，是目前幾乎所有支持全文索引的搜索引擎都需要依賴的一個數據結構。通過對倒排索引的編程實現，熟練掌握 MapReduce 程序在集群上的提交與執(zhí)行過程，加深對 MapReduce 編程框架的理解。

📚實驗平臺

操作系統(tǒng)：Linux
Hadoop 版本：3.2.2
JDK 版本：1.8
Java IDE：Eclipse

📚實驗內容

關于倒排索引

🐇在本地編寫程序和調試

在本地 eclipse 上編寫帶詞頻屬性的對英文文檔的文檔倒排索引程序，要求程序能夠實現對 stop-words(如 a,an,the,in,of 等詞)的去除，能夠統(tǒng)計單詞在每篇文檔中出現的頻率。文檔數據和停詞表可在此鏈接上下載，在偽分布式環(huán)境下完成程序的編寫和調試。

在這里插入圖片描述

🥕代碼框架思路

Map()：對輸入的Text切分為多個word。這里的Map()包含setup()和map()。每一次map都伴隨著一次setup，進行停詞，篩選那些不需要統(tǒng)計的。
Combine()：將Map輸出的中間結果相同key部分的value累加，減少向Reduce節(jié)點傳輸的數據量。
Partition()：為了將同一個word的鍵值對發(fā)送到同一個Reduce節(jié)點，對key進行臨時處理。將原key的(word, filename)臨時拆開，使Partitioner只按照word值進行選擇Reduce節(jié)點?；诠Ｖ档姆制椒ā?/li>
Reduce()：利用每個Reducer接收到的鍵值對中，word是排好序的，來進行最后的整合。將word#filename拆分開，將filename與累加和拼到一起，存在str中。每次比較當前的word和上一次的word是否相同，若相同則將filename和累加和附加到str中，否則輸出：key:word，value:str，并將新的word作為key繼續(xù)。
上述reduce()只會在遇到新word時，處理并輸出前一個word，故對于最后一個word還需要額外的處理。重載cleanup()，處理最后一個word并輸出

倒排索引的Map、Combiner、Partitioner部分就和上圖一樣

一個Map對應一個Combiner，借助Combiner對Map輸出進行一次初始整合
一個Combiner又對應一個Partitioner，Partitioner將同一個word的鍵值對發(fā)送到同一個Reduce節(jié)點

🥕代碼實現

（關注本地路徑）

package index;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.StringTokenizer;
import java.util.Vector;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;public class index
{public static class Map extends Mapper<Object, Text, Text, IntWritable> {/*** setup():讀取停詞表到vector stop_words中*/Vector<String> stop_words;//停詞表protected void setup(Context context) throws IOException {stop_words = new Vector<String>();//初始化停詞表Configuration conf = context.getConfiguration();//讀取停詞表文件BufferedReader reader = new BufferedReader(new InputStreamReader(FileSystem.get(conf).open(new Path("hdfs://localhost:9000/user/hadoop/input/stop_words_eng.txt"))));String line;while ((line = reader.readLine()) != null) {//按行處理StringTokenizer itr=new StringTokenizer(line);while(itr.hasMoreTokens()){//遍歷詞,存入vectorstop_words.add(itr.nextToken());}}reader.close();}/*** map():對輸入的Text切分為多個word* 輸入：key:當前行偏移位置     value:當前行內容* 輸出：key:word#filename    value:1*/protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {FileSplit fileSplit = (FileSplit) context.getInputSplit();String fileName = fileSplit.getPath().getName();//獲取文件名，轉換為小寫String line = value.toString().toLowerCase();//將行內容全部轉為小寫字母//只保留數字和字母String new_line="";for(int i = 0; i < line.length(); i ++) {if((line.charAt(i)>=48 && line.charAt(i)<=57) || (line.charAt(i)>=97 && line.charAt(i)<=122)) {//按行處理new_line += line.charAt(i);} else {//其他字符保存為空格new_line +=" ";}}line = new_line.trim();//去掉開頭和結尾的空格StringTokenizer strToken=new StringTokenizer(line);//按照空格拆分while(strToken.hasMoreTokens()){String str = strToken.nextToken();if(!stop_words.contains(str)) {//不是停詞則輸出key-value對context.write(new Text(str+"#"+fileName), new IntWritable(1));}}}}public static class Combine extends Reducer<Text, IntWritable, Text, IntWritable> {/*** 將Map輸出的中間結果相同key部分的value累加，減少向Reduce節(jié)點傳輸的數據量* 輸入：key:word#filename    value:1* 輸出：key:word#filename    value:累加和（詞頻）*/protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum ++;}context.write(key, new IntWritable(sum));}}public static class Partition extends HashPartitioner<Text, IntWritable> {/*** 為了將同一個word的鍵值對發(fā)送到同一個Reduce節(jié)點，對key進行臨時處理* 將原key的(word, filename)臨時拆開，使Partitioner只按照word值進行選擇Reduce節(jié)點* 基于哈希值的分片方法*/public int getPartition(Text key, IntWritable value, int numReduceTasks) {//第三個參數numPartitions表示每個Mapper的分片數，也就是Reducer的個數String term = key.toString().split("#")[0];//獲取word#filename中的wordreturn super.getPartition(new Text(term), value, numReduceTasks);//按照word分配reduce節(jié)點       }}public static class Reduce extends Reducer<Text, IntWritable, Text, Text> {/*** Reduce():利用每個Reducer接收到的鍵值對中，word是排好序的,來進行最后的整合* 將word#filename拆分開，將filename與累加和拼到一起，存在str中* 每次比較當前的word和上一次的word是否相同，若相同則將filename和累加和附加到str中，否則輸出：key:word，value:str，并將新的word作為key繼續(xù)* 輸入：*         key                  value*    word1#filename 1        [num1,num2,...]*    word1#filename 2        [num1,num2,...]*    word2#filename 1        [num1,num2,...]* 輸出：*    key:word   value:<filename1,詞頻><filename2,詞頻>...<total,總詞頻>*/private String lastfile = null;//存儲上一個filenameprivate String lastword = null;//存儲上一個wordprivate String str = "";//存儲要輸出的value內容private int count = 0;private int totalcount = 0;protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {String[] tokens = key.toString().split("#");//將word和filename存在tokens數組中if(lastword == null) {lastword = tokens[0];}if(lastfile == null) {lastfile = tokens[1];}if (!tokens[0].equals(lastword)) {//此次word與上次不一樣，則將上次的word進行處理并輸出str += "<"+lastfile+","+count+">;<total,"+totalcount+">.";context.write(new Text(lastword), new Text(str));//value部分拼接后輸出lastword = tokens[0];//更新wordlastfile = tokens[1];//更新filenamecount = 0;str="";for (IntWritable val : values) {//累加相同word和filename中出現次數count += val.get();//轉為int}totalcount = count;return;}if(!tokens[1].equals(lastfile)) {//新的文檔str += "<"+lastfile+","+count+">;";lastfile = tokens[1];//更新文檔名count = 0;//重設count值for (IntWritable value : values){//計數count += value.get();//轉為int}totalcount += count;return;}//其他情況，只計算總數即可for (IntWritable val : values) {count += val.get();totalcount += val.get();}}/*** 上述reduce()只會在遇到新word時，處理并輸出前一個word，故對于最后一個word還需要額外的處理* 重載cleanup()，處理最后一個word并輸出*/public void cleanup(Context context) throws IOException, InterruptedException {str += "<"+lastfile+","+count+">;<total,"+totalcount+">.";context.write(new Text(lastword), new Text(str));super.cleanup(context);}}public static void main(String args[]) throws Exception {Configuration conf = new Configuration();conf.set("fs.defaultFS", "hdfs://localhost:9000");if(args.length != 2) {System.err.println("Usage: Relation <in> <out>");System.exit(2);}Job job = Job.getInstance(conf, "InvertedIndex");//設置環(huán)境參數job.setJarByClass(index.class);//設置整個程序的類名job.setMapperClass(Map.class);//設置Mapper類job.setCombinerClass(Combine.class);//設置combiner類job.setPartitionerClass(Partition.class);//設置Partitioner類job.setReducerClass(Reduce.class);//設置reducer類job.setOutputKeyClass(Text.class);//設置Mapper輸出key類型job.setOutputValueClass(IntWritable.class);//設置Mapper輸出value類型FileInputFormat.addInputPath(job, new Path(args[0]));//輸入文件目錄FileOutputFormat.setOutputPath(job, new Path(args[1]));//輸出文件目錄System.exit(job.waitForCompletion(true) ? 0 : 1);//參數true表示檢查并打印 Job 和 Task 的運行狀況}}

?補充：當我們新建一個Package和Class后運行時，可能會出現如下報錯（主要是在MapReduce編程輸入輸出里會遇到）
在這里插入圖片描述
?解決辦法：

“Run As”選中“Run Configurations…”

在這里插入圖片描述

然后在“Arguments”里輸入input output，然后再run就行了。

在這里插入圖片描述

🐇在集群上提交作業(yè)并執(zhí)行

集群的服務器地址為 10.102.0.198，用戶主目錄為/home/用戶名，hdfs 目錄為/user/用戶名。集群上的實驗文檔存放目錄為 hdfs://10.102.0.198:9000/input/. 英文停詞表文件存放位置為hdfs://10.102.0.198:9000/stop_words/stop_words_eng.txt。

🥕在集群上提交作業(yè)并執(zhí)行，同本地執(zhí)行相比即需修改路徑。

在這里插入圖片描述

🥕修改后通過expoet，導出jar包，關注 Main-Class 的設置！

選中index.java右鍵Export。
如下圖選中JAR file后點Next。
確認選中index及其src，JAR的命名要和class名一樣，比如這里是index.java，就是class index，也就是index.jar。然后點Next。
到如下頁面，再點Next。

在這里插入圖片描述

在Main class那點Browse，選中index。

在這里插入圖片描述

如下圖。
最后點finish完成導出，可在文件夾里找到index.jar。雙擊index.jar，在它的METS-INT里頭查看Main-Class是否設置成功。

🥕在終端依次輸入以下指令，完成提交

使用 scp InvertedIndex.jar 用戶名@10.102.0.198:/home/用戶名 命令將本地程序提交到 Hadoop 集群
通過 ssh 用戶名@10.102.0.198 命令遠程登錄到 Hadoop 集群進行操作；
使用 hadoop jar InvertedIndex.jar /input /user/用戶名/output 命令在集群上運行 Hadoop 作業(yè)，指定輸出目錄為自己 hdfs 目錄下的 output。
使用 diff 命令判斷自己的輸出結果與標準輸出的差異

scp index.jar bigdata_學號@10.102.0.198:/home/bigdata_學號
ssh bigdata_學號@10.102.0.198
hadoop jar index.jar /input /user/bigdata_學號/output
diff <(hdfs dfs -cat /output/part-r-00000) <(hdfs dfs -cat /user/bigdata_學號/output/part-r-00000)

在瀏覽器中打開 http://10.102.0.198:8088，可以查看集群上作業(yè)的基本執(zhí)行情況。

在這里插入圖片描述