Dataset for Analysis of Russian-Language Reviews on MOOCs Extracted from Stepik

Keywords: MOOC, dataset, frequency analysis of unigrams and bigrams, setiment analysis, python-library dostoevsky, nltk, pymorphy2


The article provides an overview of datasets and research areas in the field of educational data analysis based on natural language processing methods. The overview demonstrates the lack of datasets for the analysis of Russian-language reviews on MOOCs. Based on the scraping of reviews from the Stepik platform, a dataset of 5721 Russian-language reviews for MOOCs in mathematics, programming, biology, chemistry and physics was formed. A study of Russian-language reviews from the dataset was carried out based on descriptive statistics, frequency analysis of unigrams and bigrams, sentiment analysis using the dostoevsky python library with weighted F1-score for estimation accuracy of classification by sentiment as 74%. The descriptive characteristics of courses with respect to sentiments were detected based on unigrams analysis, the description of different aspects of learning content and difficulties encountered by students in learning MOOCs were detected based on bigrams analysis. The results of the sentiment analysis demonstrate the predominance of positive and neutral reviews of MOOCs in the studied dataset. The dataset is placed in the public domain Mendeley Data and will be useful to specialists in the field of text data analysis and the development of learning analytics tools.


Download data is not yet available.


Alsaad F., Alawini A. (2020) Unsupervised Approach for Modeling Content Structures of MOOCs. Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020) (online, 2020, 10–13 July), pp. 18–28.

An Y.-H., Pan L., Kan M.-Y., Dong Q., Fu Y. (2019) Resource Mention Extraction for MOOC Discussion Forums. IEEE Access, vol. 7, pp. 87887–87900.

Andres J.M.L., Baker R.S., Gašević D., Siemens G., Crossley S.A., Joksimović S. (2018) Studying MOOC Completion at Scale Using the MOOC Replication Framework. Proceedings of the 8th International Conference on Learning Analytics and Knowledge (LAK '18) (Sydney, Australia, 2018, 07–09 March), pp. 71–78.

Atapattu T., Falkner K. (2016) A Framework for Topic Generation and Labeling from MOOC Discussions. Proceedings of the Third (2016) ACM Conference on Learning @ Scale (Edinburgh, United Kingdom, 2016, 25–29 April), pp. 201–204.

Chen Q., Chen Y., Liu D., Shi C., Wu Y., Qu H. (2016) PeakVizor: Visual Analytics of Peaks in Video Clickstreams from Massive Open Online Courses. IEEE Transactions on Visualization and Computer Graphics, vol. 22, no 10, pp. 2315–2330.

Chen Q., Yue X., Plantaz X., Chen Y., Shi C., Pong T., Qu H. (2020) ViSeq: Visual Analytics of Learning Sequence in Massive Open Online Courses. IEEE Transactions on Visualization and Computer Graphics, vol. 26, no 3, pp. 1622–1636.

Crossley S., Paquette L., Dascalu M., McNamara D.S., Baker R.S. (2016) Combining Click-Stream Data with NLP Tools to Better Understand MOOC Completion. Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (Edinburgh, United Kingdom, 2016, 25–29 April), pp. 6–14.

Dhekne C., Bansal S.K. (2018) MOOClink: An Aggregator for MOOC Offerings from Various Providers. Journal of Engineering Education Transformations, vol. 31, January, Special issue.

Dina N.Z., Yunardi R.T., Firdaus A.A. (2021) Utilizing Text Mining and Feature-Sentiment-Pairs to Support Data-Driven Design Automation Massive Open Online Course. International Journal of Emerging Technologies in Learning (iJET), vol. 16, no 1, 134–151.

Dyulicheva Yu. (2022) Dataset of MOOCs' Reviews from Stepik on Russian Language, Mendeley Data, V1, Available at: (accessed 20 November 2022).

Dyulicheva Y.Y. (2021) Uchebnaya analitika MOOK kak instrument analiza matematicheskoy trevozhnosti [Learning Analytics in MOOCs as an Instrument for Measuring Math Anxiety]. Voprosy obrazovaniya / Educational Studies Moscow, no 4, pp. 243–265.

Ezen-Can A., Boyer K.E., Kellogg S., Booth S. (2015) Unsupervised Modeling for Understanding MOOC Discussion Forums. Proceedings of the Fifth International Conference on Learning Analytics and Knowledge (Poughkeepsie, NY, 2015, 16–20 March), pp. 146–150.

Iniesto F., Rodrigo C. (2019) YourMOOC4all: A Recommender System for MOOCs Based on Collaborative Filtering Implementing UDL. Transforming Learning with Meaningful Technologies. EC-TEL 2019. Lecture Notes in Computer Science (eds M. Scheffel, J. Broisin, V. Pammer-Schindler, A. Ioannou, J. Schneider), Cham: Springer, vol. 11722, pp. 746–750.

Jiang Z., Feng S., Cong G., Miao C., Li X. (2017) A Novel Cascade Model for Learning Latent Similarity from Heterogeneous Sequential Data of MOOC. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (Copenhagen, Denmark, 2017, 07–11 September), pp. 2768–2773.

Kastrati Z., Imran A.S., Kurti A. (2020) Weakly Supervised Framework for Aspect-Based Sentiment Analysis on Students’ Reviews of MOOCs. IEEE Access, vol. 8, pp. 106799–106810.

Khalil M., Belokrys G. (2020) OXALIC: An Open edX Advanced Learning Analytics Tool. Proceedings of the 2020 IEEE Learning with MOOCS (LWMOOCS) (Antigua Guatemala, Guatemala, 2020, 29 September — 02 October), pp. 185–190.

Koffi D.D.A.S, Ouattara N., Mambe D.M., Oumtanaga S., Assohoun A.D.J.E. (2021) Cources Recommendation Algorithm Based on Performance Prediction in E-learning. IJCSNS International Journal of Computer Science and Network Security, vol. 21, no 2, pp. 148–158.

Li X., Men C., Zhang F., Du Z. (2017) A Smart Visual Analysis Solution for MOOC Data. Proceedings of the 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 15th International Conference on Pervasive Intelligence and Computing, 3rd International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech) (Orlando, FL, 2017, 06–10 November), pp. 101–106.

Lim S.L., Goh O.S. (2016) Intelligent Conversational Bot for Massive Online Open Courses (MOOCs).

Liu S., Ni C., Liu Z., Peng X., Cheng H.N. (2017) Mining Individual Learning Topics in Course Reviews Based on Author Topic Model. International Journal of Distance Education Technologies, vol. 15, no 3, pp. 1–14.

Lopez G., Seaton D.T., Ang A., Tingley D., Chuang I. (2017) Google BigQuery for Education: Framework for Parsing and Analyzing edX MOOC Data. Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale (Cambridge, MA, 2017, 2017, held on 20–21 April), pp. 181–184.

Moreno-Marcos P. M., Alario-Hoyos C., Muñoz Merino P.J., Estevez-Ayres I., Kloos C.D. (2019) A Learning Analytics Methodology for Understanding Social Interactions in MOOCs. IEEE Transactions on Learning Technologies, vol. 12, no 4, pp. 442–455.

Mu X., Xu K., Chen Q., Du F., Wang Y., Qu H. (2019) MOOCad: Visual Analysis of Anomalous Learning Activities in Massive Open Online Courses. Proceedings of the 21st Eurographics Conference on Visualization, EuroVis 2019 — Short Papers (Porto, Portugal,2019, 03–07 June) (eds J. Johansson, F. Sadlo, G.E. Marai), Porto: The Eurographics Association.

Mubarak A.A., Ahmed S.A., Cao H. (2021) MOOC-ASV: Analytical Statistical Visual Model of Learners’ Interaction in Videos of MOOC Courses. Interactive Learning Environments.

Nugumanova A.B., Akhmed-Zaki D.Zh., Bayburin E.M., Apaev K.S. (2021) Sentiment-analiz otzyvov pol'zovatelej v Fejsbuke: sravnenie bibliotek Textblob i Dostoevsky [Sentiment Analysis of Users Reviews in Facebook: Comparison of Textblob and Dostoevsky Libraries]. Bulletin of the National Engineering Academy of the Republic of Kazakhstan, no 4 (82), pp. 97–104.

Onah D., Pang E. (2021) MOOC Design Principles: Topic Modelling-Pyldavis Visualization & Summarization of Learners’ Engagement. Proceedings of the 13th Annual International Conference on Education and New Learning Technologies (online, 2021, 05–06 July), pp. 1082–1088.

Onan A. (2020) Sentiment Analysis on Massive Open Online Course Evaluations: A Text Mining and Deep Learning Approach. Computer Applications in Engineering Education, vol. 29, no 3, pp. 572–589.

Ramesh A., Goldwasser D., Huang B., Daume H., Getoor L. (2014) Understanding MOOC Discussion Forums Using Seeded LDA. Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications (Baltimore, ME, 2014, 26 June), pp. 28–33.

Reich J., Tingley D.H., Leder-Luis J., Roberts M.E., Stewart B. (2014) Computer-Assisted Reading and Discovery for Student Generated Text in Massive Open Online Courses. SSRN Electronic Journal, vol. 2, no 1, pp. 156–184.

Sarıyalçınkaya A.D., Karal H., Altinay F., Altinay Z. (2021) Reflections on Adaptive Learning Analytics: Adaptive Learning Analytics. Advancing the Power of Learning Analytics and Big Data in Education (eds A. Azevedo, J. Azevedo, J. Onohuome Uhomoibhi, E. Ossiannilsson), Hershey. PA: IGI Global, pp. 61–84.

Shah D. (2019) Year of MOOC-Based Degrees: A Review of MOOC Stats and Trends in 2018. Available at: (accessed 8 November 2022).

Shrestha S., Pokharel M. (2021) Educational Data Mining in Moodle Data. International Journal of Informatics and Communication Technology (IJ-ICT), vol. 10, no 1, pp. 9–18.

Shridharan M., Willingham A., Spencer J., Yang T., Brinton C. (2018) Predictive Learning Analytics for Video-Watching Behavior in MOOCs. Proceedings of the 52nd Annual Conference on Information Sciences and Systems (CISS) (Princeton, NJ, 2018. 21–23 March), pp. 1–6.

Siddique S.A. (2020) Improvement of Online Course Content Using MapReduce Big Data Analytics. International Research Journal of Engineering and Technology (IRJET), vol. 7, no 8, pp. 50–56.

Singh A.K., Kumar S., Bhushan S., Kumar P., Vashishtha A. (2021) A Proportional Sentiment Analysis of MOOCs Course Reviews Using Supervised Learning Algorithms. Ingénierie des systèmes d information, vol. 26, no 5, pp. 501–506.

Sun D., Li T., You F., Hu M., Li Z. (2021) Prediction of Learning Behavior Characters of MOOC’s Data Based on Time Series Analysis. Journal of Physics: Conference Series, vol. 1994, no 1, Article no 012009.

Thoms B., Eryilmaz E., Mercado G., Ramirez B., Rodriguez J. (2017) Towards a Sentiment Analyzing Discussion-Board. Proceedings of the 50th Hawaii International Conference on System Sciences (2017) (Hilton Waikoloa Village, Hawaii, 2017, 04–07 January), pp. 184–193.

Yao J., Wang L., Liu Y., Kui Y. (2021) Research on the Data Analysis System of Student Stress in English MOOC Based on Fuzzy C-Means Algorithm. Journal of Intelligent & Fuzzy Systems, May, pp. 1–11.

Yu J., Alrajhi L., Harit A., Sun Z., Cristea A.I., Shi L. (2021) Exploring Bayesian Deep Learning for Urgent Instructor Intervention Need in MOOC Forums. Proceedings of the Intelligent Tutoring Systems: 17th International Conference, ITS 2021 (online, 2021, 07–11 June), pp. 78-90.

Yu J., Luo G., Xiao T., Zhong Q. et al. (2020) MOOCCube: A Large-Scale Data Repository for NLP Applications in MOOCs. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (online, 2020, 05–10 July), pp. 3135–3142.

Wang H., Xie Y., Wen M., Yang Z. (2021) GazeMOOC: A Gaze Data Driven Visual Analytics System for MOOC with XR Content. Proceedings of the 27th ACM Symposium on Virtual Reality Software and Technology (Osaka, Japan, 2021, 08–10 December), Article no 74.

Wong J., Pursel B., Divinsky A., Jansen B.J. (2015) Analyzing MOOC Discussion Forum Messages to Identify Cognitive Learning Information Exchanges. Proceedings of the Association for Information Science and Technology, vol. 52, no 1, pp. 1–10.

Wong J., Zhang X. (2018) MessageLens: A Visual Analytics System to Support Multifaceted Exploration of MOOC Forum Discussions. Visual Informatics, vol. 2, no 1, pp. 37–49.

Zarra T., Chiheb R., Faizi R., El Afia A. (2018) MOOCs’ Recommendation Based on Forum Latent Dirichlet Allocation. Proceedings of the 2nd International Conference on Smart Digital Environment (Rabat, Morocco, 2018, 18–20 October), pp. 88–93.

How to Cite
Dyulicheva, Yulia. 2022. “Dataset for Analysis of Russian-Language Reviews on MOOCs Extracted from Stepik”. Voprosy Obrazovaniya / Educational Studies Moscow, no. 4 (December), 298–321.
Datasets in Education