ALJOHANI, TAHANI,MUSLIH,M (2022) Learner Profiling: Demographics Identification Based on
NLP, Machine Learning, and MOOCs Metadata. Doctoral thesis, Durham University.
|PDF - Accepted Version|
Massive Open Online Courses (MOOCs) have become universal learning resources, and the COVID- 19 pandemic is rendering these platforms even more necessary. Many types of research are ongoing to improve the learning resources provided to learners via MOOCs. These platforms also bring an incredible diversity of learners in terms of their demographics; thus, much MOOCs research relies on the learners’ demographics data. Traditionally, these data are extracted from pre-course questionnaires that are filled-in by the learners themselves. However, besides introducing potential cognitive overhead (asking learners to fulfil tasks outside of the main purpose of learning), this leads to a clear bias in any research based on these questionnaires. The latter is because only about 10% of the MOOCs learners provide (a given type of) demographics data (with the intersection of all types of demographic data being significantly below 10%), while others do not provide any type of their demographic data. Thus, the population data obtained via questionnaires is not representative of the actual population in the MOOCs. To resolve this issue, a research area called Learner Profiling (LP) is investigated in this thesis. This area naturally extends from a research area called Author Profiling (AP), which aims at identifying traits about authors in different domains. In- stead, LP aims to identify learners’ demographics in the online educational domain. This research specifically focused on identifying the employment status, gender, and academic level of learners in MOOCs. Classifying the employment status of learners was based on the semantic representation of their comments, and comparing the sequential with the parallel ensemble deep learning architecture (Convolutional Neural Networks and Recurrent Neural Networks). This obtained an average high accuracy of 96.3% for the best proposed method; using NLP based approach for balancing the training samples. Additionally, the task of classifying the gender of learners was tackled based on the syntactic knowledge from the learners’ comments. Different tree-structured Long-Short-Term Memory models were compared and, as a result, the researcher proposed a novel version of a bi-directional composition function for existing architectures. In addition, 18 different combinations of word-level encoding and sentence-level encoding functions for this task were compared and evaluated. Based on the results, the novel bi-directional model outperforms all other models and the highest accuracy result among the proposed models is the one based on the combination of Feedforward Neural Network and the Stack-augmented Parser-Interpreter Neural Network (82.60% classification accuracy). Next, the learner’s academic level was identified based on training small size - rich data - i.e. not only textual content (data including learner activity data). The researcher argues here that to classify a learner trait from the sparse textual content, researchers need to use additionally other features stemming from the MOOC platform, such as derived from learners’ actions on that platform. Accordingly, time stamps, quizzes, and discussions were examined, as learners’ behavioural data sources for the classification problem. This novel approach for the task achieves a high accuracy (89% on average), even with a simple classifier, irrespective of data size. To conclude, such classification models as used in this thesis show that they can achieve highly accurate results and that pre-course questionnaires to extract the demographic information with a high cognitive overhead could become obsolete.
|Item Type:||Thesis (Doctoral)|
|Award:||Doctor of Philosophy|
|Faculty and Department:||Faculty of Science > Computer Science, Department of|
|Copyright:||Copyright of this thesis is held by the author|
|Deposited On:||04 Jul 2022 14:33|