• Journal of Internet Computing and Services
    ISSN 2287 - 1136 (Online) / ISSN 1598 - 0170 (Print)
    https://jics.or.kr/

Evaluation of Topic Models with regard to N-gram Changes in Topic Representations: Focusing on Coherence and Diversity


Hyun-Jung Park, Tae-Min Lee, Heui-Seok Lim, Journal of Internet Computing and Services, Vol. 26, No. 1, pp. 19-33, Feb. 2025
10.7472/jksii.2025.26.1.19, Full Text:
Keywords: Topic Model, Text Mining, Coherence, diversity, 1~N-gram Topic Representation, LDA, BERTopic

Abstract

Topic models are text mining techniques actively applied in various domains to explore the underlying themes in large-scale text data. However, despite their high popularity, there is a significant lack of performance comparison studies on these models. To provide a comprehensive performance comparison of topic models, this study focuses on 1~N-gram topic representations in addition to unigrams. While most existing studies use unigram topic representations, 1~N-gram topic representations have the advantage of enhancing topic interpretability. However, it is challenging to calculate topic coherence for 1~N-gram topic representations using existing general-purpose Python libraries. Therefore, we first identify the causes of this difficulty, proposing and implementing solutions. Next, we model unigram and 1~3-gram topic representations of major traditional BOW-based and recent deep learning-based topic models on Korean online articles and KCI paper data, comparing and analyzing their coherence and diversity. As a result, we provide multifaceted insights for effective topic modeling: BERTopic and semi-supervised BERTopic generally outperform NMF and LDA in terms of coherence and diversity, both for unigrams and 1~3-grams. The coherence of BERTopic and semi-supervised BERTopic for 1~3-gram topic representations tends to increase with a larger number of topics, longer window sizes, and longer given texts compared to LDA or NMF, etc.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from November 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[APA Style]
Park, H., Lee, T., & Lim, H. (2025). Evaluation of Topic Models with regard to N-gram Changes in Topic Representations: Focusing on Coherence and Diversity. Journal of Internet Computing and Services, 26(1), 19-33. DOI: 10.7472/jksii.2025.26.1.19.

[IEEE Style]
H. Park, T. Lee, H. Lim, "Evaluation of Topic Models with regard to N-gram Changes in Topic Representations: Focusing on Coherence and Diversity," Journal of Internet Computing and Services, vol. 26, no. 1, pp. 19-33, 2025. DOI: 10.7472/jksii.2025.26.1.19.

[ACM Style]
Hyun-Jung Park, Tae-Min Lee, and Heui-Seok Lim. 2025. Evaluation of Topic Models with regard to N-gram Changes in Topic Representations: Focusing on Coherence and Diversity. Journal of Internet Computing and Services, 26, 1, (2025), 19-33. DOI: 10.7472/jksii.2025.26.1.19.