Dataset for Analysis of Russian-Language Reviews on MOOCs Extracted from Stepik

Keywords: MOOC, dataset, frequency analysis of unigrams and bigrams, setiment analysis, python-library dostoevsky, nltk, pymorphy2

Abstract

The article provides an overview of datasets and research areas in the field of educational data analysis based on natural language processing methods. The overview demonstrates the lack of datasets for the analysis of Russian-language reviews on MOOCs. Based on the scraping of reviews from the Stepik platform, a dataset of 5721 Russian-language reviews for MOOCs in mathematics, programming, biology, chemistry and physics was formed. A study of Russian-language reviews from the dataset was carried out based on descriptive statistics, frequency analysis of unigrams and bigrams, sentiment analysis using the dostoevsky python library with weighted F1-score for estimation accuracy of classification by sentiment as 74%. The descriptive characteristics of courses with respect to sentiments were detected based on unigrams analysis, the description of different aspects of learning content and difficulties encountered by students in learning MOOCs were detected based on bigrams analysis. The results of the sentiment analysis demonstrate the predominance of positive and neutral reviews of MOOCs in the studied dataset. The dataset is placed in the public domain Mendeley Data and will be useful to specialists in the field of text data analysis and the development of learning analytics tools.

Downloads

Download data is not yet available.

References

Alsaad F., Alawini A. (2020) Unsupervised Approach for Modeling Content Structures of MOOCs. Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020) (online, 2020, 10–13 July), pp. 18–28.

An Y.-H., Pan L., Kan M.-Y., Dong Q., Fu Y. (2019) Resource Mention Extraction for MOOC Discussion Forums. IEEE Access, vol. 7, pp. 87887–87900. https://doi.org/10.1109/access.2019.2924250

Andres J.M.L., Baker R.S., Gašević D., Siemens G., Crossley S.A., Joksimović S. (2018) Studying MOOC Completion at Scale Using the MOOC Replication Framework. Proceedings of the 8th International Conference on Learning Analytics and Knowledge (LAK '18) (Sydney, Australia, 2018, 07–09 March), pp. 71–78. https://doi.org/10.1145/3170358.3170369

Atapattu T., Falkner K. (2016) A Framework for Topic Generation and Labeling from MOOC Discussions. Proceedings of the Third (2016) ACM Conference on Learning @ Scale (Edinburgh, United Kingdom, 2016, 25–29 April), pp. 201–204. https://doi.org/10.1145/2876034.2893414

Chen Q., Chen Y., Liu D., Shi C., Wu Y., Qu H. (2016) PeakVizor: Visual Analytics of Peaks in Video Clickstreams from Massive Open Online Courses. IEEE Transactions on Visualization and Computer Graphics, vol. 22, no 10, pp. 2315–2330. https://doi.org/10.1109/tvcg.2015.2505305

Chen Q., Yue X., Plantaz X., Chen Y., Shi C., Pong T., Qu H. (2020) ViSeq: Visual Analytics of Learning Sequence in Massive Open Online Courses. IEEE Transactions on Visualization and Computer Graphics, vol. 26, no 3, pp. 1622–1636. https://doi.org/10.1109/tvcg.2018.2872961

Crossley S., Paquette L., Dascalu M., McNamara D.S., Baker R.S. (2016) Combining Click-Stream Data with NLP Tools to Better Understand MOOC Completion. Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (Edinburgh, United Kingdom, 2016, 25–29 April), pp. 6–14. https://doi.org/10.1145/2883851.2883931

Dhekne C., Bansal S.K. (2018) MOOClink: An Aggregator for MOOC Offerings from Various Providers. Journal of Engineering Education Transformations, vol. 31, January, Special issue. https://doi.org/10.16920/jeet/2018/v0i0/120907

Dina N.Z., Yunardi R.T., Firdaus A.A. (2021) Utilizing Text Mining and Feature-Sentiment-Pairs to Support Data-Driven Design Automation Massive Open Online Course. International Journal of Emerging Technologies in Learning (iJET), vol. 16, no 1, 134–151. https://doi.org/10.3991/ijet.v16i01.17095

Dyulicheva Yu. (2022) Dataset of MOOCs' Reviews from Stepik on Russian Language, Mendeley Data, V1, https://doi.org/10.17632/8rwpvrw4hw.1 Available at: https://data.mendeley.com/datasets/8rwpvrw4hw/1 (accessed 20 November 2022).

Dyulicheva Y.Y. (2021) Uchebnaya analitika MOOK kak instrument analiza matematicheskoy trevozhnosti [Learning Analytics in MOOCs as an Instrument for Measuring Math Anxiety]. Voprosy obrazovaniya / Educational Studies Moscow, no 4, pp. 243–265. https://doi.org/10.17323/1814-9545-2021-4-243-265

Ezen-Can A., Boyer K.E., Kellogg S., Booth S. (2015) Unsupervised Modeling for Understanding MOOC Discussion Forums. Proceedings of the Fifth International Conference on Learning Analytics and Knowledge (Poughkeepsie, NY, 2015, 16–20 March), pp. 146–150. https://doi.org/10.1145/2723576.2723589

Iniesto F., Rodrigo C. (2019) YourMOOC4all: A Recommender System for MOOCs Based on Collaborative Filtering Implementing UDL. Transforming Learning with Meaningful Technologies. EC-TEL 2019. Lecture Notes in Computer Science (eds M. Scheffel, J. Broisin, V. Pammer-Schindler, A. Ioannou, J. Schneider), Cham: Springer, vol. 11722, pp. 746–750. https://doi.org/10.1007/978-3-030-29736-7_80

Jiang Z., Feng S., Cong G., Miao C., Li X. (2017) A Novel Cascade Model for Learning Latent Similarity from Heterogeneous Sequential Data of MOOC. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (Copenhagen, Denmark, 2017, 07–11 September), pp. 2768–2773. https://doi.org/10.18653/v1/d17-1293

Kastrati Z., Imran A.S., Kurti A. (2020) Weakly Supervised Framework for Aspect-Based Sentiment Analysis on Students’ Reviews of MOOCs. IEEE Access, vol. 8, pp. 106799–106810. https://doi.org/10.1109/access.2020.3000739

Khalil M., Belokrys G. (2020) OXALIC: An Open edX Advanced Learning Analytics Tool. Proceedings of the 2020 IEEE Learning with MOOCS (LWMOOCS) (Antigua Guatemala, Guatemala, 2020, 29 September — 02 October), pp. 185–190. https://doi.org/10.1109/lwmoocs50143.2020.9234322

Koffi D.D.A.S, Ouattara N., Mambe D.M., Oumtanaga S., Assohoun A.D.J.E. (2021) Cources Recommendation Algorithm Based on Performance Prediction in E-learning. IJCSNS International Journal of Computer Science and Network Security, vol. 21, no 2, pp. 148–158. https://doi.org/10.22937/IJCSNS.2021.21.2.17

Li X., Men C., Zhang F., Du Z. (2017) A Smart Visual Analysis Solution for MOOC Data. Proceedings of the 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 15th International Conference on Pervasive Intelligence and Computing, 3rd International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech) (Orlando, FL, 2017, 06–10 November), pp. 101–106. https://doi.org/10.1109/dasc-picom-datacom-cyberscitec.2017.31

Lim S.L., Goh O.S. (2016) Intelligent Conversational Bot for Massive Online Open Courses (MOOCs). arxiv.org/abs/1601.07065. https://doi.org/10.48550/arXiv.1601.07065

Liu S., Ni C., Liu Z., Peng X., Cheng H.N. (2017) Mining Individual Learning Topics in Course Reviews Based on Author Topic Model. International Journal of Distance Education Technologies, vol. 15, no 3, pp. 1–14. https://doi.org/10.4018/ijdet.2017070101

Lopez G., Seaton D.T., Ang A., Tingley D., Chuang I. (2017) Google BigQuery for Education: Framework for Parsing and Analyzing edX MOOC Data. Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale (Cambridge, MA, 2017, 2017, held on 20–21 April), pp. 181–184. https://doi.org/10.1145/3051457.3053980

Moreno-Marcos P. M., Alario-Hoyos C., Muñoz Merino P.J., Estevez-Ayres I., Kloos C.D. (2019) A Learning Analytics Methodology for Understanding Social Interactions in MOOCs. IEEE Transactions on Learning Technologies, vol. 12, no 4, pp. 442–455. https://doi.org/10.1109/tlt.2018.2883419

Mu X., Xu K., Chen Q., Du F., Wang Y., Qu H. (2019) MOOCad: Visual Analysis of Anomalous Learning Activities in Massive Open Online Courses. Proceedings of the 21st Eurographics Conference on Visualization, EuroVis 2019 — Short Papers (Porto, Portugal,2019, 03–07 June) (eds J. Johansson, F. Sadlo, G.E. Marai), Porto: The Eurographics Association. https://doi.org/10.2312/evs.20191176

Mubarak A.A., Ahmed S.A., Cao H. (2021) MOOC-ASV: Analytical Statistical Visual Model of Learners’ Interaction in Videos of MOOC Courses. Interactive Learning Environments. https://doi.org/10.1080/10494820.2021.1916768

Nugumanova A.B., Akhmed-Zaki D.Zh., Bayburin E.M., Apaev K.S. (2021) Sentiment-analiz otzyvov pol'zovatelej v Fejsbuke: sravnenie bibliotek Textblob i Dostoevsky [Sentiment Analysis of Users Reviews in Facebook: Comparison of Textblob and Dostoevsky Libraries]. Bulletin of the National Engineering Academy of the Republic of Kazakhstan, no 4 (82), pp. 97–104. https://doi.org/10.47533/2020.1606-146X.120

Onah D., Pang E. (2021) MOOC Design Principles: Topic Modelling-Pyldavis Visualization & Summarization of Learners’ Engagement. Proceedings of the 13th Annual International Conference on Education and New Learning Technologies (online, 2021, 05–06 July), pp. 1082–1088. https://doi.org/10.21125/edulearn.2021.0282

Onan A. (2020) Sentiment Analysis on Massive Open Online Course Evaluations: A Text Mining and Deep Learning Approach. Computer Applications in Engineering Education, vol. 29, no 3, pp. 572–589. https://doi.org/10.1002/cae.22253

Ramesh A., Goldwasser D., Huang B., Daume H., Getoor L. (2014) Understanding MOOC Discussion Forums Using Seeded LDA. Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications (Baltimore, ME, 2014, 26 June), pp. 28–33. https://doi.org/10.3115/v1/w14-1804

Reich J., Tingley D.H., Leder-Luis J., Roberts M.E., Stewart B. (2014) Computer-Assisted Reading and Discovery for Student Generated Text in Massive Open Online Courses. SSRN Electronic Journal, vol. 2, no 1, pp. 156–184. https://doi.org/10.2139/ssrn.2499725

Sarıyalçınkaya A.D., Karal H., Altinay F., Altinay Z. (2021) Reflections on Adaptive Learning Analytics: Adaptive Learning Analytics. Advancing the Power of Learning Analytics and Big Data in Education (eds A. Azevedo, J. Azevedo, J. Onohuome Uhomoibhi, E. Ossiannilsson), Hershey. PA: IGI Global, pp. 61–84. https://doi.org/10.4018/978-1-7998-7103-3.ch003

Shah D. (2019) Year of MOOC-Based Degrees: A Review of MOOC Stats and Trends in 2018. Available at: https://www.classcentral.com/report/moocs-stats-and-trends-2018/ (accessed 8 November 2022).

Shrestha S., Pokharel M. (2021) Educational Data Mining in Moodle Data. International Journal of Informatics and Communication Technology (IJ-ICT), vol. 10, no 1, pp. 9–18. https://doi.org/10.11591/ijict.v10i1.pp9-18

Shridharan M., Willingham A., Spencer J., Yang T., Brinton C. (2018) Predictive Learning Analytics for Video-Watching Behavior in MOOCs. Proceedings of the 52nd Annual Conference on Information Sciences and Systems (CISS) (Princeton, NJ, 2018. 21–23 March), pp. 1–6. https://doi.org/10.1109/ciss.2018.8362323

Siddique S.A. (2020) Improvement of Online Course Content Using MapReduce Big Data Analytics. International Research Journal of Engineering and Technology (IRJET), vol. 7, no 8, pp. 50–56.

Singh A.K., Kumar S., Bhushan S., Kumar P., Vashishtha A. (2021) A Proportional Sentiment Analysis of MOOCs Course Reviews Using Supervised Learning Algorithms. Ingénierie des systèmes d information, vol. 26, no 5, pp. 501–506. https://doi.org/10.18280/isi.260510

Sun D., Li T., You F., Hu M., Li Z. (2021) Prediction of Learning Behavior Characters of MOOC’s Data Based on Time Series Analysis. Journal of Physics: Conference Series, vol. 1994, no 1, Article no 012009. https://doi.org/10.1088/1742-6596/1994/1/012009

Thoms B., Eryilmaz E., Mercado G., Ramirez B., Rodriguez J. (2017) Towards a Sentiment Analyzing Discussion-Board. Proceedings of the 50th Hawaii International Conference on System Sciences (2017) (Hilton Waikoloa Village, Hawaii, 2017, 04–07 January), pp. 184–193. https://doi.org/10.24251/hicss.2017.021

Yao J., Wang L., Liu Y., Kui Y. (2021) Research on the Data Analysis System of Student Stress in English MOOC Based on Fuzzy C-Means Algorithm. Journal of Intelligent & Fuzzy Systems, May, pp. 1–11. https://doi.org/10.3233/jifs-219048

Yu J., Alrajhi L., Harit A., Sun Z., Cristea A.I., Shi L. (2021) Exploring Bayesian Deep Learning for Urgent Instructor Intervention Need in MOOC Forums. Proceedings of the Intelligent Tutoring Systems: 17th International Conference, ITS 2021 (online, 2021, 07–11 June), pp. 78-90. https://doi.org/10.1007/978-3-030-80421-3_10

Yu J., Luo G., Xiao T., Zhong Q. et al. (2020) MOOCCube: A Large-Scale Data Repository for NLP Applications in MOOCs. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (online, 2020, 05–10 July), pp. 3135–3142. https://doi.org/10.18653/v1/2020.acl-main.285

Wang H., Xie Y., Wen M., Yang Z. (2021) GazeMOOC: A Gaze Data Driven Visual Analytics System for MOOC with XR Content. Proceedings of the 27th ACM Symposium on Virtual Reality Software and Technology (Osaka, Japan, 2021, 08–10 December), Article no 74. https://doi.org/10.1145/3489849.3489923

Wong J., Pursel B., Divinsky A., Jansen B.J. (2015) Analyzing MOOC Discussion Forum Messages to Identify Cognitive Learning Information Exchanges. Proceedings of the Association for Information Science and Technology, vol. 52, no 1, pp. 1–10. https://doi.org/10.1002/pra2.2015.145052010023

Wong J., Zhang X. (2018) MessageLens: A Visual Analytics System to Support Multifaceted Exploration of MOOC Forum Discussions. Visual Informatics, vol. 2, no 1, pp. 37–49. https://doi.org/10.1016/j.visinf.2018.04.005

Zarra T., Chiheb R., Faizi R., El Afia A. (2018) MOOCs’ Recommendation Based on Forum Latent Dirichlet Allocation. Proceedings of the 2nd International Conference on Smart Digital Environment (Rabat, Morocco, 2018, 18–20 October), pp. 88–93. https://doi.org/10.1145/3289100.3289115

Published
2022-12-23
How to Cite
Dyulicheva, Yulia. 2022. “Dataset for Analysis of Russian-Language Reviews on MOOCs Extracted from Stepik”. Voprosy Obrazovaniya / Educational Studies Moscow, no. 4 (December), 298–321. https://doi.org/10.17323/1814-9545-2022-4-298-321.
Section
Datasets in Education