Applications of Statistical Language Models in Complex Network Community Detection and Definition Modeling

Zhu, Ruimin

doi:https://doi.org/10.21985/n2-q0r2-xp07

Work

Applications of Statistical Language Models in Complex Network Community Detection and Definition Modeling

Public

Download PDF

Modeling human language is at the very frontier of machine learning and artificial intelligence. Statistical language models are probabilistic models that assign probabilities to sequences of words. For example, topic models are frequently used text-mining tools to organize a vast set of unstructured documents by exploring their theme structure. More recent advancements in neural language models make complex tasks such as machine translation, dialogue generation, and abstractive summarization possible. Statistical language models can be applied to other types of data as well. For instance, n-grams models are used in DNA sequence analysis. Complex network community detection is an important area of research. Algorithms include both deterministic ones and probabilistic ones. Many community detection algorithms require prior knowledge of the number of communities in the network or need multiple trials to find the best one. Also, existing probabilistic community detection algorithms are usually relatively simple in structure. Such algorithms, while being robust, can misrepresent the network's statistical properties. Thus, they fail to accomplish optimal community partitions. Definition modeling automates natural language definition generation by training on examples of dictionary data to extract word sense encapsulated in word embeddings. For polysemous words, existing models rely on auxiliary information such as contextual usage of the target word to generate multi-sense definitions. Another common issue with current definition models is the lack of correctness and fluency of model outputs. This dissertation is an empirical study of statistical language models on the complex network community detection problem and the definition modeling task. We address the aforementioned shortcomings of existing community detection algorithms by generating random walk pseudo documents and applying a Bayesian nonparametric topic model to reveal the topic or community structures. We allocate two chapters to study the definition modeling task. In the work of multi-sense definition modeling, we enable the model to output sense-specific definitions without using auxiliary information. We also introduce novel techniques for matching multi-sense embeddings with ground-truth definitions during training. In the other chapter, we combine the Transformer model and reinforcement learning techniques to improve definition model outputs quality. Our method gains significant improvement in output correctness and fluency.

Creator