Evaluation of NLP-based System for L2 Learning and Assessment

Pre-conference Workshops

Pedagogical Applications of Corpus Tools

Alyson McClair & Ahmet Dursun

Corpus linguistics is a powerful tool for creating authentic teaching and learning materials. In this workshop, participants will learn about three tools for the language learning classroom: COCA, Antconc and AntwordProfiler. Participants will gain practice in the use of each tool and ideas for how to us them in the classroom.

Online Pedagogical Tools for Interaction and Feedback

Hyejin Yang, Erin Todey & Amy Walton

In this workshop, participants will learn about Englishcentral and virtual tutors. What is Englishcentral? Englishcentral is a website that enables English learners to improve listening skills and their pronunciation by watching authentic videos regarding various topics, such as cultures,business, educations, and so forth. Virtual tutors create an environment for students to interact with a program using written language. In this session, participants will gain practice in the use of each tool and ideas for how to us them in the classroom.

Plenary Presentation


Automated Essay Evaluation: Advances and Potential for L2 Writing

Jill Burstein

There are a number of outstanding questions related to language learner (L2) writing and the use of automated essay evaluation systems in assessment and instructional settings. To some extent, these questions are generalizable to native English speaker (L1) writing. In this talk, three important questions (Xi, 2010) will be highlighted that introduce concerns about use of the technology for language learner writing: (1) Does the use of assessment tasks constrained by automated scoring technologies lead to construct under- or misrepresentation? ; (2) Do the automated scoring features under- or mis-represent the construct of interest?; and (3) Does the use of automated scoring have a positive impact on teaching and learning practices? How these questions are addressed by researchers and developers of automated essay evaluation technologies could have implications for practice in assessment and instruction for L1s and L2s. In this talk, an overview will be provided of automated essay evaluation, and current computational methods used to evaluate linguistic properties in essay data. To address the three questions, NLP methods that advance the state-of-the-art in automated essay evaluation, and potentially expand writing construct coverage in essay evaluation technology will be discussed.

References Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27(3), 291-300.



Proceed with Caution and Curiosity: Integrating Automated Writing Evaluation Software into the Classroom

Paige Ware

Over the last decade, software developers have actively improved the writing assistance features of automated scoring systems (Shermis & Burstein, 2003). With this shift in focus toward formative feedback, researchers interested in writing instruction have begun to investigate how the writing assistance tools of automated writing evaluation (AWE) software can supplement classroom instruction (Chen & Cheng, 2008; Ericsson & Haswell, 2006; Shermis & Burstein, 2003; Ware, 2011; Warschauer & Ware, 2006). This talk explores two recurring issues in discussions about the classroom integration of AWE: (1) the interplay among institutional pressures, classroom practices, and logistical constraints; and (2) pedagogical recommendations that emerge from a small but growing empirical base. Findings from a recent study of feedback provision on middle school writing serve to highlight various points within these framing issues. The mixed methods study examined the integration of AWE as one of three forms of feedback (AWE, human feedback delivered electronically, and face-to-face peer review) for middle school writers in an urban school. Quantitative results from an analysis of pre/post essays and qualitative findings from surveys and interviews help illustrate some of the key themes about the benefits and tradeoffs of integrating AWE into writing instruction.


Presentation Abstracts, Day 1


Lexical bundles: Enhancing automated analysis of Methodology sections

Viviana Cortes & Elena Cotos

The analysis of lexical bundles (LBs) has been the focus of many studies in applied linguistics, equipping academic writing practice with knowledge about these units’ register-specific use (Hyland, 2008). This knowledge has successfully been translated to instructional materials; however, LBs have not been used to inform the development of intelligent learning technologies. Existing automated scoring engines analyze discourse structure, syntactic structure, and vocabulary usage by using a combination of statistical and NLP techniques to extract linguistic features from training corpora (Valenti et al., 2003), but their feature sets do not include LBs.

With insights from LB studies on research articles (Cortes, 2004, 2009), this paper proposes that LBs are particularly informative for automated analysis of Methodology discourse. We report on a group of LBs identified in a one-million word corpus of Methodology sections in 30 disciplines. The LBs analysis started with 4-word combinations and extended to the largest possible combinations, all of which were classified grammatically and functionally. These analyses agreed with previous studies in the most frequent types of grammatical LB correlates and the most frequent functions performed but showed several differences in the bundles found across disciplines. Then, more disciplinary variation was found when the LB sets we classified into Methods’ communicative moves and steps (Cotos et al., 2012). The findings reveal important descriptors of Methods writing across and within disciplines. In our discussion of implications and immediate application, we explain how LBs will help enhance the Methods section analyzer of an NLP-based tool currently under development.



Linguatorium in Teaching English Writing for Specific Purposes

Evgeny Chukharev-Hudilainen, Tatiana Klepikova

When teaching English writing for specific purposes to speakers of other languages an important issue is building a working knowledge of terminology. In a non-immersion environment vocabulary acquisition is time and effort-consuming. Linguatorium is an on-line e-tutoring system specifically designed to address this challenge through iterative repetition of lexical items controlled by an adaptive algorithm. Based on the automatic analysis of the students’ performance (“latent assessment”), the algorithm estimates how well students know the words, builds and constantly updates student models, generates tailor-made exercises. The students are asked to spend at least 10 minutes a day on working with Linguatorium, and it takes an average of 2 minutes overall for a word to be stored in the student’s long-term memory. In this randomized double-blind study, we analyze the effectiveness of Linguatorium and the accuracy of its “latent assessment.” For a semester, students of the Russian State Maritime Academy in St. Petersburg used Linguatorium to practice marine engineering terminology and general vocabulary presented and exercised in their weekly EFL/ESP class. An independent examiner was recruited to administer a vocabulary production (writing) test on 112 words to a subset of 22 students. Our findings indicate that Linguatorium is a highly effective tool for lexical acquisition, as it proved to increase the long-term vocabulary recall from 18% at baseline to 56% (p<.01) in the writing test. However, the “latent assessment” algorithm was found not to account for the long-term forgetting process and therefore calls for improvement.



Using a Systemic Functional Model as the Basis for NLP-Based Oral Assessment

Jesse  Gleason

There is currently a lack of research pertaining to the adequacy of a theoretical model upon which to base NLP-based speaking assessments (Chun, 2006; Downey, et al. 2008; Van Moere, 2012; Xi, 2008). One approach applies psycholinguistic theory in order to “offset the noise in measurement”, a consequence of performance-based or communicative testing thus, providing reliable and standardized evidence of a contrived psycholinguistic construct (Van Moere, 2012, p. 8). Limited by the capacities of automatic speech recognition technology, current automated scoring procedures exclusively consider examinees’ language form while ignoring how they use language to make meaning. And grammar, from a traditional perspective, treats form separately from meaning. Unfortunately, NLP-based systems currently only incorporate erroneous grammatical forms using a native-speaker model as their sole basis for score interpretation and use. Such an approach can be problematic since grammatical form is only one criterion for successful oral communication.

In order to offer stakeholders more trustworthy evidence of how L2 examinees might be expected to use language to make meaning in real world target language domains, the current paper puts forth an alternative theoretical model anchored in systemic functional linguistics (SFL). A specific example is given of how an SFL transitivity analysis might be used to evaluate an NLP-based test task, the Versant™ for Spanish story-retell, as an example of a factual recount (Derewianka, 1990). Using transitivity analysis (Ravelli, 2000), 19 examinee responses were assessed for both meaning and form in order to explore their ability to make meaning with language. In so doing, an alternative method to current psycholinguistic approaches to scoring, which rely exclusively on linguistic form and structure to offer evidence of L2 speaking ability, is presented. This approach holds enormous promise for contributing to a re-conceptualization of automated scoring procedures on NLP-based L2 oral test tasks to address both the content and form.



How to build NLP-based language tutorials for Web 2.0 applications

L. Kirk Hagen

In this paper I explain a web-based language tutorial that runs on an NLP engine called “Hanoi.” I illustrate with Spanish, though the parser has made forays into other languages as well (Moosally & Hagen, 2001). The tutorial elicits structured but open-ended input from users. The parser then assigns a grammatical structure to the input or, when it is ill-formed, flags errors and links them to tutorials. A typical activity has 500 words available to users. The combinatorial possibilities of a vocabulary that large exceeds the processing capabilities of any PC; a mere 15-word lexicon allows for more than a trillion unique responses. Thus a string-matching approach to NLP is out of the question, and the Hanoi parser therefore this uses a purely rule-driven approach. My paper includes discussion of user interface –i.e., how to build language activities around NLP software – as well as of the syntactic framework that undergirds the parser. I argue that computer-based second language tutoring is actually among the best practical applications for NLP technology. Progress in NLP has been perennially hindered by the astonishing complexity of grammatical rules and the sheer size of the lexicon of any language (see Jackendoff, 2002, p. 88; Hagen 2008, 17-22). Among SL learners, on the other hand, one expects a small corpus of both rules and vocabulary. In this more restrictive context, NLP offers much richer kinds of feedback and interactions for language students.



The role of CALL in hybrid and online language courses

NLP-based analysis and feedback on rhetorical functions

Elena Cotos, Nick Pendar, Deepan Prabhu Babu

Applying principles from linguistics and computer science, NLP technology constitutes the basis for automated scoring of written constructed responses. It has also been integrated in the development of intelligent instructional tools. However, while targeting complex writing constructs (Burstein, 2003; Elliot, 2003), the pedagogical extrapolation of this technology is limited, as it is still confined to a single genre – the essay. To date, the only application that employs NLP for the analysis of a different genre, the research article, is IADE. Studies on its effectiveness reinforce the potential and value of NLP for the design of new context and needs-based intelligent instructional tools (Cotos, 2011; Pendar & Cotos, 2008).

This paper presents the automated analysis engine of the Research Writing Tutor (RWT), a developing application similar to IADE in that it aims at providing genre and discipline specific feedback on the functional units of research article discourse. RWT, however, tackles a more complex NLP problem. Unlike traditional text categorization applications that categorize complete documents (Sebastiani, 2005), it categorizes every sentence in the text as both a move and a step using a 17-step schema (Swales, 1981). We report on constructing a cascade of two support vector machine classifiers (Vapnik, 1995) trained on a multi-disciplinary corpus of annotated texts. Specifically, we focus on the development of our Introduction section classifiers, which achieved 77% accuracy on moves and 72% on steps. We also report output error analysis results and discuss the implications of our findings for approaches to feedback generation.


Evaluating the accuracy of machine-based feedback on research article Introductions

Elena Cotos, Stephanie Link, Aysel Saricaoglu, Ruslan Suvorov

Over the past decade, the use of technology has gained prominence in the field of L2 learning and assessment, particularly with the advancements of NLP-based automated writing evaluation (AWE) (Attali, Bridgeman, & Trapani, 2010). However, while there is extensive system-centric research supporting the reliability of scoring engines at the core of AWE programs (e.g., e-rater(r) and IntelliMetric), there is little evidence of accuracy and helpfulness of the feedback they generate. Since AWE is being increasingly adopted in L2 writing, it is arguable that obtaining such evidence needs to become a focal aspect of effectiveness-related empirical inquiries (Warschauer & Ware, 2006). Our study attempts to address this gap by exploring the discourse-level feedback provided by the Research Writing Tutor (RWT), a developing corpus-based tool, whose analysis engine uses machine learning techniques to evaluate rhetorical functions in different sections of research articles. Specifically, we describe how RTW feedback is operationalized for Introduction sections, report on how accurate it is compared to human raters, and explain how accuracy issues are addressed by the feedback generation model. The data for this study is largely qualitative, consisting of student drafts, detailed logs with RWT-generated sentence-level feedback on each draft, and the same drafts manually classified into rhetorical functions by two coders. The comparison of machine and human analyses yielded encouraging results, which were also supported by the screen captures of students’ interaction with the tool. Our findings have important implications not only for RWT implementation, but also for the design of NLP-based feedback systems.


Using rhetorical, contextual, and linguistic indices to predict writing quality

Scott Crossley, Danielle McNamara, Liang Guo

This study explores the degree to which rhetorical, contextual, and linguistic features predict second language (L2) writing proficiency for the independent writing tasks in the TOEFL. Over 500 indices reported by the computational tool Coh-Metrix were examined for this analysis. These indices included traditional measures of linguistic complexity used to assess text difficulty along with a variety of new indices developed specifically to assess writing quality for the Intelligent Tutoring System Writing-Pal. These computational indices were regressed onto the human scores for a corpus of 480 independent TOEFL essays. The results of the analysis demonstrate that 15 indices related to lemma types, n-gram frequency, lexical sophistication, syntactic complexity, narrativity, grammatical complexity, cohesion, contextual relevance, and rhetorical features predicted 69% of the variance in the human scores. When classifying the essays based on the holistic score assigned by the human raters, the reported regression model provided exact matches for 58% of the essays and adjacent matches (i.e., within 1 point of the human score) for 99% of the essays. The results of this analysis provide strong evidence that rhetorical, contextual, and linguistic features can be used to accurately predict human ratings of writing quality for independent L2 writing tasks. Implications for these findings in reference to classroom instruction, essay scoring, L2 writing quality, and intelligent tutoring systems will be discussed.


Predicting ESL Examinee Writing Scores Using the Academic Formula List

Sarah J. Goodwin, Scott Crossley, Liang Guo

Second language (L2) learners must not only know a great deal of vocabulary; they must also be aware of multi-word constructions (Ellis, Simpson-Vlach, & Maynard, 2008), which vary across spoken and written registers and carry specific discourse functions (Biber, Conrad, & Cortes, 2004). To meet expected discourse and rhetorical functions, L2 writers need to be able to use specific formulaic units, indicating that more proficient L2 writers will produce a greater number of expected academic formulas. This study tests this hypothesis by investigating L2 writers’ use of multi-word constructions found in the Academic Formula List (AFL; Simpson-Vlach & Ellis, 2010) and their strength in predicting human ratings of writing quality as reported in the TOEFL iBT public use dataset. Independent and integrated essay samples from the TOEFL were run through natural language processing tools in order to identify the number and type of AFL constructions each L2 essay contained. The resulting AFL values were entered into linear regression models to predict the human judgments of essay quality. The findings indicate that AFL constructions are not strong predictors of essay quality, with two AFL variables (Core AFL and Written AFL constructions) explaining only 6% of the variance for independent essay scores and three AFL variables (specification of attributes, topic introduction and focus, and written AFL constructions) explaining 11% of the variance for integrated essay scores. These findings provide important pedagogical implications about the value of AFL constructions in standardized testing and their effects on human judgments of essay quality.


Automated Writing Evaluation: Enough about reliability! What really matters for students and teachers?

Jooyoung Lee, Zhi Li, Stephanie Link, Hyejin Yang, Volker Hegelheimer

The development of natural language processing tools for writing has emerged as an area of high interest in the field of applied linguistics. To date, a number of computer-based programs offer second/foreign language learners and instructors automated writing evaluation (AWE) tools such as Criterion and MyAccess for the purpose of classroom use, and research on these tools continues to increase (e.g., Grimes & Warschauer, 2010). While studies and technical reports have claimed that AWE holds strong potential for L2 learning (e.g., Chodorow, Gamon & Tetreault, 2010; Cotos, 2011; Grimes & Warschauer, 2006), most findings are from the developer and researcher perspective. Few studies, however, have explored what diverse stakeholders such as ESL teachers and students actually expect from the use of AWE. This study demonstrates how knowledge of stakeholder expectations and perceptions about AWE can be beneficial with respect to future development of one AWE tool, Criterion, and different ways in which design and integration of classroom technology can be enhanced. To explore and understand the various perspectives toward AWE, we conducted a needs analysis targeting responses from each group of stakeholders, including ESL students, teachers, program coordinators, and software developers by administering questionnaires (6-point Likert scale items and open-ended questions) and conducting semi-structured interviews. Responses include information on perceptions about the current version of Criterion as well as expectations and suggestions for further improvement of the interface, scoring system, and feedback. Results of the needs analysis provide prospective users of AWE with insightful information to better understand the possible discrepancies among different stakeholders in terms of the perception versus actual use of Criterion in the writing classroom.


The Impact of Criterion® on Error Correction: A Longitudinal Study

Hye-won Lee, Jinrong Li, Volker Hegelheimer

Previous studies on automated writing evaluation (AWE) were mostly based on short-term error reduction (Chodorow, Gamon, & Tetreault, 2010), and they did not fully reveal the extent to which AWEs can facilitate students’ error correction and progress in grammatical accuracy on a long-term basis. Therefore, this study aims to examine the potential effects of immediate feedback from a piece of AWE software, Criterion®, on students’ success and difficulties in error correction and changes in their correction behavior across one academic semester. As part of a larger research project on the instructional use of Criterion® in ESL academic writing, writing samples were collected at the beginning and end of a semester to investigate corrections for article errors. Articles are known as one of the most difficult elements of English for non-native speakers to master (Dalgish, 1985; Diab, 1997; Izumi et al., 2003; Bitchener et al., 2005) and thus the focus of the study. The article errors identified by Criterion® and associated error feedback were analyzed from the two aspects. First, types of Criterion® feedback were classified based on how accurately the feedback identifies an article error and how specifically it advises the students to make corrections. Second, types of students’ error correction on these highlighted errors were explored per feedback type, and different patterns of success, failure, or neglect in correction were identified in each feedback type. It is expected that the findings would reveal more insights into the potential of using AWEs in L2 writing classroom.


Exploring usability and perceived effectiveness of a developing AWE tool

Sarah Huffman

The swiftly escalating popularity of automated writing evaluation (AWE) software in recent years has compelled much study into its potential for effective pedagogical use (Chen & Cheng, 2008; Cotos, 2011; Warschauer & Ware, 2006). Research on the effectiveness of AWE tools has concentrated primarily on determining learners’ achieved output (Warschauer & Ware, 2006) and emphasized the attainment of linguistic goals (Escudier et. al., 2011); however, in-process investigations of users’ experience with and perceptions of AWE tools remain sparse (Shute, 2008; Ware, 2011). This study employs a mixed-methods approach to investigate how users interact with and perceive the effectiveness of the Research Writing Tutor (RWT), an emerging AWE tool which provides discourse-oriented, discipline-specific feedback on users’ section drafts of empirical research papers. Nine students (seven NNSs and two NSs) enrolled in a graduate-level course on academic research writing submitted drafts of Introduction sections to the RWT for analysis and feedback. Screen recordings of students’ interactions with the RWT, stimulated recall transcripts, quantitative and qualitative survey responses and usability data were analyzed to capture a multidimensional depiction of students’ experience with the RWT. Findings reveal that, despite the developing tool’s inaccuracies, students were optimistic about the potential usefulness of the RWT and willing to contribute valuable suggestions for how to improve the tool. Results also show a tendency for students to compare AWE feedback to human feedback, perhaps rooted in skepticism about automated systems. Such process-based research has implications for improving pedagogical uses for, user experience with and design of AWE software.


Design and User Experience Study of the Automated Research Writing Tutor

Nandhini Ramaswamy and Stephen Gilbert

Much Automated Writing Evaluation (AWE) research has evaluated the reliability of analysis of AWE technologies as well as the outcomes of students who use them. However, the outcomes and the overall learning experience may be influenced not only by the quality of automated analysis, but also by the design features of the system. To our knowledge, no work rooted in the principles of user interface design has been conducted to help better understand whether and how the design may impact users’ interaction with AWE. We report on a user experience study of a new AWE web-based program, the Research Writing Tutor, primarily developed and tested by graduate students at a mid-western university. The user interface of this tool is based on human-computer interaction principles in order to provide better user experience and to enhance intuitive use of program features. It is also conceptualized to make the features that are important for users’ learning goals salient. In this paper, we describe the program’s design and present the results of several user evaluation sessions. The users were three groups of students with different levels of knowledge about the research writing conventions built into the program’s features and feedback: instructed, somewhat instructed, and not instructed in research writing. Evidence from video recordings made with Morae usability testing software, mouse movement tracking data, and survey responses indicate that the interface design may play an important role in learners’ development of user strategies and on their perception of the utility and effectiveness of the tool.