Corpus-Assisted Research

Corpus-Assisted Research

After watching the recorded presentations, join these authors for a live panel discussion on December 4, 2020 at 8:30 am – 9:00 am (CST). Moderator: Kimberly Becker


Hong Ma and Jinglei Wang

Assistant Professor
Zhejiang University, China

Link to Hong Ma's second presentation

Assigning Students' Writing Samples to CEFR Levels Automatically: A Machin-learning Approach

This project intends to propose a method of assigning students' writing samples to CEFR levels (Common European Framework of Reference for Languages) automatically. We believe that the method we proposed, relying on big data and machine-learning algorithm, will facilitate future endeavors in alignment and writing evaluation.
The data includes 1500 writing samples selected from the EF-Cambridge Open Language Database (EFCAMDAT), which is a publicly available corpus containing over 83 million words from 1 million assignments written by 174,000 learners worldwide, across a wide range of levels (CEFR stages A1-C2). The 1500 writing samples are equally distributed across all six levels of CEFR. The quality indexes of students' writing samples, are obtained through the automatic writing analysis tool (Coh-metrix).
This project uses a machine-learning technique to model the predicting relationship between quality indexes (the independent variables) and CEFR levels (the dependent variable) of students' writing samples, since machine learning methods, emerging in linguistic research recently, has generally demonstrated higher accuracy in classification tasks than traditional regression models (McNamara, Crossley, Roscoe, Allen & Dai, 2015). In similar endeavors, the accuracy of different machine learning classifiers has been reported in N-gram recognition tasks (Jarvis, 2011), and the discriminant function analysis (one of the machine learning classifiers) was used for predicting scores of students' argumentative essays (McNamara et al., 2015). In the current research, we adopted a more advanced machine-learning classifier, multiple supporting vector machine recursive feature elimination (MSVM-REF), which has demonstrated considerably high accuracy in more complicated classifying tasks (e.g. classification and selection of better gene subsets in cancer study) (Duan, Rajapakse, Wang, & Azuaje, 2005). The adoption of this machine-learning classifier will not only result in an algorithm that assign students' writing samples to CEFR levels automatically, but also rank features that discriminating different levels of writing qualities. These top features yield pedagogical implications important to writing instruction.
Video Recording

Yongkook Won

Visiting Researcher
Center for Educational Research, Seoul National University

Link to Yongkook Won's second presentation

Topic Modeling Analysis of Research Trends of English Language Teaching in Korea

The goal of this study is to understand the research trends of English language teaching (ELT) in Korea for the last 20 years from 2000 to 2019. To this end, 11 major academic journals in Korea related to ELT were selected, and abstracts of 7,035 articles published in the journals were collected and analyzed. The number of articles published in the journals continued to increase from the first half of the 2000s to the first half of the 2010s, but decreased somewhat in the late 2010s. Text data in the abstracts were preprocessed using NLTK tokenizer (Bird, Loper, & Klein, 2009) and spaCy POS tagger (Honnibal & Montani, 2017), and only the nouns in the data were used for further analysis. Based on the previous studies on ELT research trends (Kim & Kim, 2015), 25 topics were extracted from abstracts of the articles by applying latent Dirichlet allocation (LDA) topic modeling with the R package topicmodels (Grün & Hornik, 2011). Teacher, tertiary education, listening, language testing, and curriculum appeared as topics that were frequently studied in the field of ELT. A result of time series regression analysis shows that rising topics include task-based learning, tertiary education, vocabulary, affective factors, and peer feedback, while falling topics include speaking, culture, and computer-assisted language learning (CALL) (at α = .001).
Video Recording