Twitter Sentiment Analysis: A Comparative Evaluation of Linear and Tree-Based Methods

Authors

  • Ali Ahmed Superior University, Department of Computer Science and Information Technology, Lahore, Pakistan
  • Dr. Jawad Ahmed Faculty of Computer Science and Information Technology Superior University, Lahore, Pakistan
  • Dr. Saleem Mustafa Faculty of Computer Science and Information Technology Superior University, Lahore, Pakistan

Keywords:

Twitter sentiment analysis, Sentiment 140 dataset, Logistic Regression, LightGBM, Random Forest, TF-IDF, computational scalability

Abstract

Twitter sentiment analysis faces different challenges from noisy text, high-dimensional data, and computational requirements. This study evaluates three Machine Learning Models – Logistic Regression (LR), LightGBM (LGBM), and Random Forest (RF) – on the Sentiment140 dataset (1.6 million tweets) to identify optimized approaches for large-scale sentiment classification. A robust and thorough preprocessing pipeline, including text cleaning, stemming, and TF-IDF vectorization, was applied to address noisy data that was mostly linguistic noise that was inherent in social media content. Stratified sampling ensured balanced training and testing data splits.

Results showed that LR Model achieved the highest test accuracy score (77.67%), outperforming LGBM (76.98%), despite LGBM’s superior probabilistic calibration (log loss: 0.4837). RF failed to complete training within 8 hours due to computation inefficiency with high-dimensional TF-IDF features, highlighting its impracticality for large text datasets with high-dimensional data. The findings underscore that linear models like LR excel in sparse, high-dimensional spaces, while gradient-boosted trees (LGBM) require careful hyper-parameter tuning to balance speed and accuracy.

This study emphasizes the importance of model selection based on task priorities. LR for interpretability and LGBM for probabilistic reliability. RF’s failure illustrates the critical role of scalability in real-world NLP applications. Practical implications suggest that simpler models can rival complex ensembles in text classification, reducing computational costs. Future work should explore hybrid approaches, hyper-parameter optimization, and transformer-based embedding (e.g., BERT) to enhance performance. The methodology provides a reproducible framework for efficient sentiment analysis, guiding researchers and practitioners in balancing accuracy, speed and resource constraints.

Downloads

Published

2025-09-30