Identifying topic models for user generated content like hotel reviews turns out to be difficult with the standard approach of LDA (Latent Dirichlet Allocation; Blei et al., 2003). Hotel review texts usually don't differ as much in the topics that are covered as is typical with other genres such as Wikipedia or newsgroup articles where there is commonly only a very small set of topics present in each document.
To this end, we developed our own approach to topic modeling that is especially tailored to non-edited texts like hotel reviews. The approach can be divided into three major steps. First, using the concept of second-order cooccurrences we define a contextual similarity score that enables us to identify words that are similar with respect to certain topics. This score allows us to build up a topic network where nodes are words and edges the contextual similarity between the words. With the help of algorithms from graph theory, like the Infomap algorithm (Rosvall and Bergstrom, 2008), we are able to detect clusters of highly connected words that can be identified as topics in our review texts. In a further step, we use these clusters and the respective words to get a topic similarity score for each word in the network. In other words, we transform a hard clustering of words into topics into a probability score of how likely a certain word belongs to a given topic/cluster.
The presentation is structured as follows:
References: David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent dirichlet allocation. In: Journal of Machine Learning Research, Jg. 3 (2003), S. 993–1022, ISSN 1532-4435 M. Rosvall and C. T. Bergstrom, Maps of information flow reveal community structure in complex networks, PNAS 105, 1118 (2008) http://dx.doi.org/10.1073/pnas.0706851105, http://arxiv.org/abs/0707.0609