Mining Social Media for Healthcare Intelligence

Public Deposited

Social media such as Twitter has risen as a powerful new communication medium for disseminating information on news, personal interests, experiences, and opinions. On social media, people talk about their lifestyle, health conditions and symptoms, search information on treatment options, and connect with people who have been through similar medical experiences to get emotional support. Such health information generated by patients or family members is not available in medical documents created by health care providers and became publicly available only recently with the prevalent use of microblogging sites, which makes social media an invaluable source of health data to mine. However, social media data is often short, unstructured, and written in colloquial languages, and these characteristics pose many interesting research questions. In this thesis, we focused on mining public Twitter data for healthcare intelligence. We designed models based on bag-of-words and social network structure features that classify trending topics into general categories such as sports, technology and health. This model could help identify trending topics and posts in health domain and benefit information retrieval tasks by reducing the search space to a domain of interest. We also proposed a real-time digital disease surveillance system that uses spatial, temporal, and text mining techniques to track disease activities. Our work was motivated by the fact that, while traditional disease surveillance systems require 1-2 weeks time to collect and process before the data becomes publicly available, Twitter data is available near real-time and the aggregated social media data can provide an overall health state of the general population earlier than the traditional disease surveillance systems can. We further built a neural network model that combines Twitter data with the observed data from Centers for Disease Control and Prevention (CDC) to predict current and future influenza activities. Our system can serve as a proxy for early detection of pandemics and the resulting insights are expected to help facilitate faster response to and preparation for epidemics. We also investigated the use of clinical knowledge sources to train deep learning models for medical concept normalization in which health conditions described in natural (colloquial) language are mapped to a standard clinical term. The proposed model can help an automatic system to effectively interpret health concepts written in layman’s language. The studies presented in this thesis provide interesting insights into the application of machine learning and text mining on social media data in healthcare domain. We hope our work motivates further study of online user-generated data to gain meaningful healthcare insights.

Last modified
  • 02/15/2019
Date created
Resource type
Rights statement