sae wordpress 圖片北京網(wǎng)站優(yōu)化站優(yōu)化
文章目錄
- 敏感詞過濾
- 方案一:正則表達式
- 方案二:基于DFA算法的敏感詞過濾工具框架-sensitive-word
- springboot集成sensitive-word
- 步驟一:引入pom
- 步驟二:自定義配置
- 步驟三:自定義敏感詞+白名單
- 步驟四:核心方法測試
敏感詞過濾
敏感詞過濾通常是指從文本中檢測并移除或替換掉被認為是不適當、冒犯性或違反特定社區(qū)準則的詞匯。這個過程常用于在線平臺、論壇、社交媒體和聊天系統(tǒng)等,以確保交流環(huán)境的健康和積極.
方案一:正則表達式
實現(xiàn)敏感詞過濾.只適合于敏感詞較少、文本量較少的場合,并且無法處理同音字、錯別字等,案例:
public static void main(String[] args) {String text = "這是一個包含敏感詞匯的文本,例如色情、賭博等。";String[] sensitiveWords = {"色情", "賭博"};for (String word : sensitiveWords) {text = filterSensitiveWords(text, word);}System.out.println("過濾后的文本: " + text);testSensitiveWordFrame();}/*** 方案一:正則表達式實現(xiàn)敏感詞過濾.只適合于敏感詞較少、文本量較少的場合,并且無法處理同音字、錯別字等.** @param text* @param sensitiveWord* @return*/public static String filterSensitiveWords(String text, String sensitiveWord) {Pattern pattern = Pattern.compile(sensitiveWord);Matcher matcher = pattern.matcher(text);return matcher.replaceAll("***");}
方案二:基于DFA算法的敏感詞過濾工具框架-sensitive-word
* 6W+ 詞庫,且不斷優(yōu)化更新* 基于 DFA 算法,性能較好* 基于 fluent-api 實現(xiàn),使用優(yōu)雅簡潔* 支持敏感詞的判斷、返回、脫敏等常見操作* 支持全角半角互換* 支持英文大小寫互換* 支持數(shù)字常見形式的互換* 支持中文繁簡體互換* 支持英文常見形式的互換* 支持用戶自定義敏感詞和白名單* 支持數(shù)據(jù)的數(shù)據(jù)動態(tài)更新,實時生效
springboot集成sensitive-word
步驟一:引入pom
<dependency><groupId>com.github.houbb</groupId><artifactId>sensitive-word</artifactId><version>0.2.0</version>
</dependency>
步驟二:自定義配置
@Configuration
public class MySensitiveWordBs {@Autowiredprivate MyWordAllow myWordAllow;@Autowiredprivate MyWordDeny myWordDeny;@Autowiredprivate MyWordReplace myWordReplace;/*** 初始化引導類** @return 初始化引導類* @since 1.0.0*/@Beanpublic SensitiveWordBs sensitiveWordBs() {SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance()
// .wordAllow(WordAllows.chains(WordAllows.defaults(), myWordAllow)) // 設置多個敏感詞,系統(tǒng)默認和自定義
// .wordDeny(WordDenys.chains(WordDenys.defaults(), myWordDeny)) // 設置多個敏感詞,系統(tǒng)默認和自定義.wordAllow(WordAllows.chains(myWordAllow)) // 自定義.wordDeny(WordDenys.chains(myWordDeny)) // 自定義.wordReplace(myWordReplace) // 自定義替換規(guī)則.ignoreCase(true) // 忽略大小寫.ignoreWidth(true) // 忽略半角圓角.ignoreNumStyle(true) // 忽略數(shù)字的寫法.ignoreChineseStyle(true) // 忽略中文的書寫格式.ignoreEnglishStyle(true) // 忽略英文的書寫格式.ignoreRepeat(true) // 忽略重復詞.enableNumCheck(true) // 是否啟用數(shù)字檢測。默認連續(xù) 8 位數(shù)字認為是敏感詞.enableEmailCheck(true) // 是有啟用郵箱檢測.enableUrlCheck(true) // 是否啟用鏈接檢測.init();return sensitiveWordBs;}
}
步驟三:自定義敏感詞+白名單
/*** 自定義非敏感詞* 注意每一行為一個非敏感詞,單行不能只包括空格,否則,也會把空格識別為非敏感詞*/
@Component
@Slf4j
public class MyWordAllow implements IWordAllow {@Overridepublic List<String> allow() {List<String> allowWords = new ArrayList<>();try {ClassPathResource resource = new ClassPathResource("myAllowWords.txt");Path myAllowWordsPath = Paths.get(resource.getUrl().toURI());allowWords = Files.readAllLines(myAllowWordsPath, StandardCharsets.UTF_8);} catch (IOException ioException) {log.error("讀取非敏感詞文件錯誤:{}", ioException);} catch (URISyntaxException e) {throw new RuntimeException(e);}return allowWords;}
}
@Component
@Slf4j
public class MyWordDeny implements IWordDeny {@Overridepublic List<String> deny() {List<String> denyWords = new ArrayList<>();try {ClassPathResource resource = new ClassPathResource("myDenyWords.txt");Path myAllowWordsPath = Paths.get(resource.getUrl().toURI());denyWords = Files.readAllLines(myAllowWordsPath, StandardCharsets.UTF_8);} catch (IOException ioException) {log.error("讀取敏感詞文件錯誤:{}", ioException);} catch (URISyntaxException e) {throw new RuntimeException(e);}return denyWords;}
}
/*** 自定義敏感詞對應的替換值.* 場景說明:有時候我們希望不同的敏感詞有不同的替換結果。比如【游戲】替換為【電子競技】,【失業(yè)】替換為【靈活就業(yè)】。*/
@Configuration
public class MyWordReplace implements IWordReplace {@Overridepublic void replace(StringBuilder stringBuilder, final char[] rawChars, IWordResult wordResult, IWordContext wordContext) {String sensitiveWord = InnerWordCharUtils.getString(rawChars, wordResult);if ("zhupeng".equals(sensitiveWord)) {stringBuilder.append("朱鵬");} else {// 其他默認使用 * 代替int wordLength = wordResult.endIndex() - wordResult.startIndex();for (int i = 0; i < wordLength; i++) {stringBuilder.append('-');}}}
}
步驟四:核心方法測試
public class SensitiveWordController {@Autowiredprivate MyWordReplace myWordReplace;@Autowiredprivate SensitiveWordBs sensitiveWordBs;private static final String text = "五星紅旗迎風飄揚,毛主席的畫像屹立在天安門前,zhuzhuhzu";@GetMapping("/pattern")public void testSensitiveWord2() {String text = "這是一個包含敏感詞匯的文本,例如色情、賭博等。";String[] sensitiveWords = {"色情", "賭博"};for (String word : sensitiveWords) {text = filterSensitiveWords(text, word);}System.out.println("過濾后的文本: " + text);}/*** 方案二:基于DFA算法的敏感詞過濾工具框架-sensitive-word:https://github.com/houbb/sensitive-word* 6W+ 詞庫,且不斷優(yōu)化更新* 基于 DFA 算法,性能較好* 基于 fluent-api 實現(xiàn),使用優(yōu)雅簡潔* 支持敏感詞的判斷、返回、脫敏等常見操作* 支持全角半角互換* 支持英文大小寫互換* 支持數(shù)字常見形式的互換* 支持中文繁簡體互換* 支持英文常見形式的互換* 支持用戶自定義敏感詞和白名單* 支持數(shù)據(jù)的數(shù)據(jù)動態(tài)更新,實時生效*/@GetMapping("/filter")public void testSensitiveWord() {System.out.println("SensitiveWordHelper.contains(text) = " + SensitiveWordHelper.contains(text));System.out.println("SensitiveWordHelper.findAll(text) = " + SensitiveWordHelper.findAll(text));System.out.println("SensitiveWordHelper.replace(text,myWordReplace) = " + SensitiveWordHelper.replace(text, myWordReplace));// 如果自定義敏感詞,不要使用SensitiveWordHelper的方法,要使用SensitiveWordBsSystem.out.println("sensitiveWordBs.contains(text) = " + sensitiveWordBs.contains(text));System.out.println("sensitiveWordBs.findAll(text) = " + sensitiveWordBs.findAll(text));System.out.println("sensitiveWordBs.replace(text) = " + sensitiveWordBs.replace(text));}
}