文本描述
伴随着互联网发展和智能设备的普及,网络舆情的影响越来越大,企业以及 政府机构也越来越重视网络舆情的应用和管理。网络舆情的应用和管理,首要的 任务是对舆情数据进行关键信息的提取,即主题提取。 目前的主题提取方法主要基于概率主题模型,利用主题与分词、分词与文本 之间的概率分布提取文本主题。但概率主题模型未充分考虑文本中词与主题之间 的语义相关性。本文利用机器学习的方法对网络舆情中的主题进行提取,将主题 提取问题定义为文本主题(即文本类别)的多标签分类问题。 在文本数据的相似性度量方面,提出了基于百度百科注释信息的文本语义相 似度计算方法,首先将文本进行分词处理;然后,应用改进TF-IDF方法对分词 对应的百度百科词条中的词语进行权重计算,将词条转换为由词条分词的权重向 量,并用余弦相似度来计算分词之间的相似性;最后,基于分词之间的相似度值, 利用相似矩阵计算文本之间相似性。在Words-240数据集上的实验结果表明,基 于百度百科注释信息的文本语义相似度与人工标记结果之间的高度相关。 在文本数据的多标签分类方面,设计了基于标签关系的核极限学习机多标签 分类方法。该方法根据标签之间共现和不共现分布,学习标签之间的积极关系和 消极关系;然后应用学习所得的标签间的关系来优化核极限学习机的分类预测结 果。为了验证该方法的有效性,在Zhihu、Yeast、Image、Scene、Emotions、 Cal500等数据集上进行了实验,实验结果表明,基于标签关系的核极限学习机 多标签分类算法,在准确度、精度、召回率以及F1指数这四个指标均优于其他 对比方法。 关键词:网络舆情,主题提取,语义分析,多标签分类,核极限学习机 II ABSTRACT With the development of Internet and the popularization of intelligent devices, the influence of online public opinion is growing. Enterprises and government agencies also pay more and more attention to the application and management of online public opinion. In the application and management of online public opinion, the first task is to extract key information from public opinion data, which is also called topic extraction. Current topic extraction methods are mainly based on probabilistic topic model, using the probability distribution between topic and term, term and text to extract text topic. However, the probabilistic topic model does not fully consider the semantic relevance between terms and topics in the text. This paper uses machine learning to extract topics in online public opinion, and defines the topic extraction problem as a multi-label classification problem of text topic (text category). In terms of similarity measurement of text data, this paper proposes a method of text semantic similarity calculation based on Baidu Baike annotation information. Firstly, the text is preprocessed by word segmentation and some other process. Then, the improved TF-IDF method is applied to calculate the weight of words in Baidu Baike entries corresponding to the terms. The entries are transformed into weight vectors of words, and the similarity between the entries is calculated by cosine similarity. Finally, text similarity is calculated by similarity matrix based on the similarity values between terms. The experimental results on Words-240 data set show that the text semantic similarity based on Baidu Baike annotation information is highly correlated with the results of manual tagging. In the multi-label classification of text data, this paper designs a multi-label classification method based on label relationship for Kernel Extreme Learning Machine. This method learns the positive and negative relationships among labels according to the co-occurrence and non-co-occurrence distribution among labels. Then label relationships are used to optimize the classification prediction results of the Kernel Extreme Learning Machine. In order to verify the validity of this method, experiments are carried out on some real-world data sets, i.e. Zhihu, Yeast, Image, Scene, Emotions, and Cal500. The experimental results show that the multi-label III classification algorithm of Kernel Extreme Learning Machine based on label relationship is superior to other comparison methods in accuracy, precision, recall rate and F1 index. KEY WORDS: Online public opinion, Topic extraction, Semantic analysis, Multi-label classification, Kernel Extreme Learning Machine IV 目录 摘要.............................................................................................................................. I ABSTRACT .................................................................................................................. II 第1章 绪论.................................................................................................................. 1 1.1研究背景及意义 .................................................................................................. 1 1.2国内外研究现状 .................................................................................................. 2 1.3主要研究问题及研究思路 .................................................................................. 2 1.4本文的组织结构 .................................................................................................. 4 第2章 相关工作概述.................................................................................................. 7 2.1 主题提取 ............................................................................................................. 7 2.1.1 概率主题模型 .............................................................................................. 7 2.1.2 基于机器学习的主题提取 .......................................................................... 9 2.2文本相似度计算 ................................................................................................ 10 2.1.1基于统计的方法 ......................................................................................... 10 2.1.2基于语义的方法 ......................................................................................... 11 2.1.3 基于混合的方法 ........................................................................................ 13 2.2多标签分类方法 ................................................................................................ 13 2.2.1 多标签算法分类 ........................................................................................ 13 2.2.2 极限学习机 ................................................................................................ 15 2.3 本章小结 ........................................................................................................... 17 第3章 基于百度百科注释信息的文本语义相似性度量........................................ 19 3.1数据预处理 ........................................................................................................ 20 3.2 词语相似度计算 ............................................................................................... 21 3.2.1 百度百科 .................................................................................................... 21 3.2.2 改进TF-IDF的词语权重计算 .................................................................. 23 3.2.3 词语语义相似度计算 ................................................................................ 25 3.3文本相似度计算 ................................................................................................ 26 3.4 实验与结果分析 ............................................................................................... 27 V 3.4.1 实验环境 .................................................................................................... 27 3.4.2 数据获取 .................................................................................................... 27 3.4.3 语义相似度计算结果 ................................................................................ 28 3.5 本章小结 ........................................................................................................... 29 第4章 基于极限学习机的多标签主题分类方法.................................................... 31 4.1 核极限学习机 ................................................................................................... 32 4.2 多标签关系学习 ...................................................................................