Dodona: learn to code with a virtual co-teacher that supports active learning

Dodona (dodona.ugent.be) is an intelligent tutoring system for computer programming. It bridges the gap between assessment and learning by providing real-time data and feedback to help students learn better, teachers teach better and educational technology become more effective. We demonstrate how Dodona can be used as a virtual co-teacher to stimulate active learning and support challenge-based education in open and collaborative learning environments. We also highlight some of the opportunities (automated feedback, learning analytics, educational data mining) and challenges (scalable feedback, open internet exams, plagiarism) we faced in practice. Dodona is free for use and has more than 36 thousand registered users across many educational and research institutes, of which 15 thousand new users registered last year. Lowering the barriers for such a broad adoption was achieved by following best practices and extensible approaches for software development, authentication, content management, assessment, security and interoperability, and by adopting a holistic view on computer-assisted learning and teaching that spans all aspects of managing courses that involve programming assignments. The source code of Dodona is available on GitHub under the permissive MIT open-source license.


Introduction
The only way to learn how to solve problems with computer programs is by solving lots of problems, and programming assignments are the main way in which such practice is generated (Gibbs and Simpson, 2005).Because of its potential to establish feedback loops that are scalable and responsive enough for an active learning environment, automated source code assessment has become a driving force in computer science, statistics and data science courses (Ala-Mutka, 2005;Douce et al., 2005;Ihantola et al., 2010;Paiva et al., 2022).Automated assessment was introduced in programming education in the early 1960s (Hollingsworth, 1960) and enables students to receive immediate and customized feedback upon each submitted solution without the need for any human intervention, especially when provided through interactive web applications (Wasik et al., 2018).Due to the iterative nature of software development -and problem solving in general -establishing such a collaborative dialogue while students work towards acceptable solutions for their programming assignments could hardly be achieved by human assessment alone (Bell, 2011;Ihantola et al., 2010).In fact, Cheang et al. (2003) identi ed the labor-intensiveness of assessing programming assignments as the main reason why few such assignments are given to the students, when ideally they should be given many more.Freeing teachers from the assessment burden creates possibilities for stimulating students to practice more often, which is recognized as a good practice for improving programming competences (Woit and Mason, 2003).
Given the clear advantages in speed, availability, consistency and objectivity (Ala-Mutka, 2005), automated assessment is not only capable of supporting assessment of learning, but also assessment for learning.Assessment for learning focuses on formative assessment to provide feedback to students to enhance their learning and not only summatively assess their learning at the end.However, when implementing automated assessment, there is a need for careful pedagogical design of programming assignments and assessment strategies (Forisek, 2006).Failing to do so may reduce motivation and stimulate cheating (Wootton, 2002).Given the limitations of assessment tools and depending on a course's learning goals, teaching context and available human resources, teachers therefore need to decide what feedback they want and can provide to their students and when, how and by whom it is provided (Gibbs and Simpson, 2005).Generic feedback on programming assignments or customized feedback on student submissions may come as tips & tricks, code quality metrics, software testing reports, source code annotations, model solutions and grades (Ala-Mutka, 2005).Feedback might be provided before students start working on an assignment ("feed up"), while they are working on their solution ("feed back" on how they performed and "feed forward" on what to do next) or after the submission deadline has passed ("feed back" on their overall performance) (Hattie and Timperley, 2007).Feedback that takes some time to generate can only be supplied asynchronously, while synchronous delivery also becomes an option when feedback is immediately available (Chickering and Gamson, 1987).
As a result, a lot of pedagogical decisions need to be made when designing and running courses that involve automatically assessed programming assignments.What programming language is used (Crick, 2017)?Should introductory programming merely focus on learning the syntax and semantics of a programming language?Or should it simultaneously also stress the importance of programming skills for problem solving with computers, readability through good programming style (Rogers et al., 2014), performance through data structures and algorithms, or maintainability through writing software tests (Edwards, 2004;Marrero and Settle, 2005)?Should assessment primarily report on shortcomings of student submissions or also hint on how these defects might be remedied (Rivers and Koedinger, 2017)?Should grades and more elaborate feedback be generated purely based on automated assessment, or does it also require human intervention (Douce et al., 2005;Jackson, 2000)?Should assignments be restricted to what can reasonably be assessed automatically or should they remain as authentic as possible?Should students learn to use standard software development tools for writing, building, running, testing and debugging software as supplied by modern integrated development environments or should they use tools speci cally designed for use in an educational context?Should students learn to understand the diagnostic messages generated by compilers and interpreters or do we provide help deciphering them (Becker et al., 2019)?Should students be somehow restricted in submitting solutions for programming assignments or should feedback from automated assessment be reduced (Ihantola et al., 2010)?Without any exception, all intelligent tutoring systems for automated source code assessment that were developed over the years have hardcoded some of these choices, either by explicit design or silently by not supporting alternatives.Ihantola et al. (2010) identi ed lack of exibility due to such hardcoded restrictions as one of the main reasons why few systems have seen adoption beyond the course or institute where they were initially developed, and found their open source policy, lifespan, interoperability and portability quite disappointing.
This paper introduces Dodona (dodona.ugent.be)as an online learning environment that embraces the importance of active learning and just-in-time feedback in courses involving programming assignments.After presenting some of its key features for computer-assisted learning and teaching (Section 2), we discuss how we use the platform in managing an introductory programming course with a strong focus on active and online learning (Section 3).This case study can be read as an inspiration for running programming courses in an open and collaborative learning environment, but also explains the context that guided some of the design choices we made in developing Dodona.We further discuss how Dodona succeeds in breaking the walls for EdTech tools and open educational resources to be e ectively used beyond the context in which they were initially created (Section 4).

Key features of Dodona
Dodona is an intelligent tutoring system for computer programming that is built around a generic infrastructure for automatic assessment and a distributed model for developing and publishing learning material.This allows it to cope with the multifaceted nature of assessing source code submitted for programming assignments by supporting di erent programming languages, runtime environments, evaluation criteria, software testing techniques and target audiences.But the platform also endorses the blended learning idea that feedback can be provided as a multi-step process, mixing the complementary strengths of learning from self, peers, instructors, teachers and software agents to deal with their limitations in producing the required volume and thoroughness (Cooper, 2000).In that vision, automated assessment is only a rst step in providing remedial feedback in small chunks while students work on their programming assignments.The frequency and responsiveness of automated assessment, which can be provided reasonably economically, compensates for the basic level and lack of individualization of its feedback (Gibbs and Simpson, 2005).To further deal with the economy of scale, the feedback from automated assessment is used as a stepping stone to streamline human interventions whenever students ask for more customized feedback or when reviewing and grading source code submitted during high-stake tests and exams.Dodona also lowers the barriers for broader adoption of the tool by following best practices and extensible models for authentication, content management, assessment, security and interoperability, and by adopting a holistic view on computer-assisted learning and teaching that spans all aspects of managing courses, from internationalization and localization to learning analytics and educational data mining.In what follows, we cover each of these features in more detail.

Classroom management
In Dodona, a course is where teachers and instructors e ectively manage a learning environment by instructing, monitoring and evaluating their students and interacting with them, either individually or as a group.A Dodona user who created a course becomes its rst administrator and can promote other registered users as course administrators.In what follows, we will also use the generic term teacher as a synonym for course administrators if this Dodona-speci c interpretation is clear from the context, but keep in mind that courses may have multiple administrators.
The course itself is laid out as a learning path that consists of course units called series, each containing a sequence of learning activities (Figure 1).Among the learning activities we di erentiate between reading activities that can be marked as read and programming assignments with support for automated assessment of submitted solutions.Learning paths are composed as a recommended sequence of learning activities to build knowledge progressively, allowing students to monitor their own progress at any point in time.Courses can either be created from scratch or from copying an existing course and making additions, deletions and rearrangements to its learning path.
Students can self-register to courses in order to avoid unnecessary user management.A course can either be announced in the public overview of Dodona for everyone to see, or be limited in visibility to students from a certain educational institution.Alternatively, students can be invited to a hidden course by sharing a secret link.Independent of course visibility, registration for a course can either be open to everyone, restricted to users from the institution the course is associated with or new registrations can be disabled altogether.Registrations are either approved automatically or require explicit approval by a teacher.Registered users can be tagged with one or more labels to create subgroups that may play a role in learning analytics and reporting.
Figure 1: Main course page (administrator view) showing some series with deadlines, reading activities and programming assignments in its learning path.At any point in time, students can see their own progress through the learning path of the course.Teachers have some additional icons in the navigation bar (top) that lead to an overview of all students and their progress, an overview of all submissions for programming assignments, general learning analytics about the course, course management and a dashboard with questions from students in various stages from being answered (Figure 6).The red dot on the latter icon noti es that some student questions are still pending.
Students and teachers more or less see the same course page, except for some management features and learning analytics that are reserved for teachers.Teachers can make content in the learning path temporarily inaccessible and/or invisible to students.Content is typically made inaccessible when it is still in preparation or if it will be used for evaluating students during a speci c period.A token link can be used to grant access to invisible content, e.g. when taking a test or exam from a subgroup of students.
Students can only mark reading activities as read once, but there is no restriction on the number of solutions they can submit for programming assignments.Submitted solutions are automatically assessed and students receive immediate feedback as soon as the assessment has completed, usually within a few seconds.Dodona stores all submissions, along with submission metadata and generated feedback, such that the submission and feedback history can be reclaimed at all times.On top of automated assessment, student submissions may be further assessed and graded manually by a teacher.
Series can have a deadline.Passed deadlines do not prevent students from marking reading activities or submitting solutions for programming assignments in their series.However, learning analytics, reports and exports usually only take into account submissions before the deadline.Because of the importance of deadlines and to avoid discussions with students about missed deadlines, series deadlines are not only announced on the course page.The student's home page highlights upcoming deadlines for individual courses and across all courses.While working on a programming assignment, students also start to see a clear warning from ten minutes before a deadline onwards.Courses also provide an iCalendar link (Stenerson and Dawson, 1998) that students can use to publish course deadlines in their personal calendar application.
Because Dodona logs all student submissions and their metadata, including feedback and grades from automated and manual assessment, we use that data to integrate reports and learning analytics in the course page (Ferguson, 2012).We also provide export wizards that enable the extraction of raw and aggregated data in CSV-format for downstream processing and educational data mining (Baker and Yacef, 2009;Romero and Ventura, 2010).This allows teachers to better understand student behavior, progress and knowledge, and might give deeper insight into the underlying factors that contribute to student actions (Ihantola et al., 2010).Understanding, knowledge and insights that can be used to make informed decisions about courses and their pedagogy, increase student engagement, and identify at-risk students (Van Petegem et al., 2022).

User management
Instead of providing its own authentication and authorization, Dodona delegates authentication to external identity providers (e.g.educational and research institutions) through SAML (Farrell et al., 2002), OAuth (Leiba, 2012;Hardt, 2012) and OpenID Connect (Sakimura et al., 2014).This support for decentralized authentication allows users to bene t from single sign-on when using their institutional account across multiple platforms and teachers to trust their students' identities when taking high-stakes tests and exams in Dodona.
Dodona automatically creates user accounts upon successful authentication and uses the association with external identity providers to assign an institution to users.By default, newly created users are assigned a student role.Teachers and instructors who wish to create content (courses, learning activities and judges), must rst request teacher rights1 using a streamlined form.

Automated assessment
The range of approaches, techniques and tools for software testing that may underpin assessing the quality of software under test is incredibly diverse.Static testing directly analyses the syntax, struc-Figure 2: Outline of the procedure to automatically assess a student submission for a programming assignment.Dodona instantiates a Docker container (1) from the image linked to the assignment (or from the default image linked to the judge of the assignment) and loads the submission and its metadata (2), the judge linked to the assignment (3) and the assessment resources of the assignment (4) into the container.Dodona then launches the actual assessment, collects and bundles the generated feedback (5), and stores it into a database along with the submission and its metadata.
ture and data ow of source code, whereas dynamic testing involves running the code with a given set of test cases (Graham et al., 2021;Oberkampf and Roy, 2010).Black-box testing uses test cases that examine functionality exposed to end-users without looking at the actual source code, whereas white-box testing hooks test cases onto the internal structure of the code to test speci c paths within a single unit, between units during integration, or between subsystems (Nidhra and Dondeti, 2012).So, broadly speaking, there are three levels of white-box testing: unit testing, integration testing and system testing (Dooley, 2011;Wiegers, 1996).Source code submitted by students can therefore be veri ed and validated against a multitude of criteria: functional completeness and correctness, architectural design, usability, performance and scalability in terms of speed, concurrency and memory footprint, security, readability (programming style), maintainability (test quality) and reliability (Staubitz et al., 2015).This is also re ected by the fact that a diverse range of metrics for measuring software quality have come forward, such as cohesion/coupling (Yourdon and Constantine, 1979;Stevens et al., 1999), cyclomatic complexity (McCabe, 1976) or test coverage (Miller and Maloney, 1963).
To cope with such a diversity in software testing alternatives, Dodona is centered around a generic infrastructure for programming assignments that support automated assessment.Assessment of a student submission for an assignment comprises three loosely coupled components: a container, a judge and an assignment-speci c assessment con guration.
For proper virtualization we use Docker containers (Peveler et al., 2019) that use OS-level containerization technologies and de ne runtime environments in which all data and executable software (e.g., scripts, compilers, interpreters, linters, database systems) are provided and executed.These resources are typically pre-installed in the image of the container.Prior to launching the actual assessment, the container is extended with the submission, the judge and the resources included in the assessment con guration (Figure 2).Additional resources can be downloaded and/or installed during the assessment itself, provided that Internet access is granted to the container.
The actual assessment of the student submission is done by a software component called a judge (Wasik et al., 2018).The judge must be robust enough to provide feedback on all possible submissions for the assignment, especially submissions that are incorrect or deliberately want to tamper with the automatic assessment procedure (Forisek, 2006).Following the principles of software reuse, the judge is ideally also a generic framework that can be used to assess submissions for multiple assignments.This is enabled by the submission metadata that is passed when calling the judge, which includes the path to the source code of the submission, the path to the assessment resources of the assignment and other metadata such as programming language, natural language, time limit and memory limit.
Rather than providing a xed set of judges, Dodona adopts a minimalistic interface that allows third parties to create new judges2 : automatic assessment is bootstrapped by launching the judge's run executable that can fetch the JSON formatted submission metadata from standard input and must generate JSON formatted feedback on standard output.The feedback has a standardized hierarchical structure that is speci ed in a JSON schema3 .At the lowest level, tests are a form of structured feedback expressed as a pair of generated and expected results.They typically test some behavior of the submitted code against expected behavior.Tests can have a brief description and snippets of unstructured feedback called messages.Descriptions and messages can be formatted as plain text, HTML (including images), Markdown, or source code.Tests can be grouped into test cases, which in turn can be grouped into contexts and eventually into tabs.All these hierarchical levels can have descriptions and messages of their own and serve no other purpose than visually grouping tests in the user interface.At the top level, a submission has a ne-grained status that re ects the overall assessment of the submission: compilation error (the submitted code did not compile), runtime error (executing the submitted code failed during assessment), memory limit exceeded (memory limit was exceeded during assessment), time limit exceeded (assessment did not complete within the given time), output limit exceeded (too much output was generated during assessment), wrong (assessment completed but not all strict requirements were ful lled), or correct (assessment completed and all strict requirements were ful lled).
Where automatic assessment and feedback generation is outsourced to the judge linked to an assignment, Dodona itself takes up the responsibility for rendering the feedback.This frees judge developers from putting e ort in feedback rendering and gives a coherent look-and-feel even for students that solve programming assignments assessed by di erent judges.Because the way feedback is presented is very important (Mani et al., 2014), we took great care in designing how feedback is displayed to make its interpretation as easy as possible (Figure 3).Di erences between generated and expected output are automatically highlighted for each failed test (Myers, 1986), and users can swap between displaying the output lines side-by-side or interleaved to make di erences more comparable.We even provide speci c support for highlighting di erences between tabular data such as CSV-les, database tables and dataframes.Users have the option to dynamically hide contexts whose test cases all succeeded, allowing them to immediately pinpoint reported mistakes in feedback that contains lots of succeeded test cases.To ease debugging the source code of submissions for Python assignments, the Python Tutor (Guo, 2013) can be launched directly from any context with a combination of the submitted source code and the test code from the context.Students typically report this as one of the most useful features of Dodona.

Content management
Where courses are created and managed in Dodona itself, other content is managed in external git repositories (Figure 4).In this distributed content management model, a repository either contains a single judge or a collection of learning activities: reading activities and/or programming assignments 5 .Setting up a webhook for the repository guarantees that any changes pushed to its default branch are automatically and immediately synchronized with Dodona.This even works without the need to make repositories public, as they may contain information that should not be disclosed such as programming assignments that are under construction, contain model solutions, or will be used during tests or exams.Instead, a Dodona service account must be granted push/pull access to the repository.Some settings of a learning activity can be modi ed through the web interface of Dodona, but any changes are always Figure 3: Dodona rendering of feedback generated by the judge that assessed a submission of the Python programming assignment "Curling" 4 .The judge split its feedback across three tabs, one for each function that needs to be implemented for this assignment: isinside, isvalid and score.All tests under the isinside and isvalid tabs passed, but 48 tests under the score tab failed as can be seen immediately from the badge in the tab header.Dodona also added a fourth tab "Code" that displays the source code of the submission with annotations added during automatic and/or manual assessment (Figure 7).Green/red vertical lines on the left re ect the grouping of test cases into execution contexts (here each execution context contains a single test case).Dodona automatically highlighted the di erences between the generated and expected return values of the rst and third (failed) test case and the judge used unstructured HTML snippets to add a graphical representation (SVG) of the curling stone positions that are passed as arguments to the score function for these failed test cases.In addition to highlighting di erences between the generated and expected return values of the rst (failed) test case, the judge also added an unstructured text snippet that indicates that a tuple was expected (not a list).pushed back to the repository in which the learning activity is con gured so that it always remains the master copy.
We experienced that working with git has a learning curve for some content creators, who otherwise de nitely bene t from its version control capabilities.Due to the distributed nature of content management, creators also keep ownership over their content and control who may co-create.After all, access to a repository is completely independent from access to its learning activities that are published in Dodona.The latter is part of the con guration of learning activities, with the option to either share learning activities so that all teachers can include them in their courses or to restrict inclusion of learning activities to courses that are explicitly granted access.Dodona automatically stores metadata about all learning activities such as content type, natural language, programming language and repository to increase their ndability in our large collection.Learning activities may also be tagged with additional labels as part of their con guration.
Any repository containing learning activities must have a prede ned directory structure6 .Directories that contain a learning activity also have their own internal directory structure 6 that includes a description in Markdown or HTML.Descriptions may reference data les and multimedia content included in the repository, and such content can be shared across all learning activities in the repository.Embedded images are automatically encapsulated in a responsive lightbox to improve readability7 .Mathematical formulas in descriptions are supported through MathJax (Cervone, 2012).
While reading activities only consist of descriptions, programming assignments need an additional assessment con guration that sets a programming language and a judge8 .The con guration may also set a Docker image, a time limit, a memory limit and grant Internet access to the container that is instantiated from the image, but these settings have proper default values.Judges, for example, have a default image that is used if the con guration of a programming assignment does not specify one explicitly.Dodona builds the available images from Docker les speci ed in a separate git repository9 .The con guration might also provide additional assessment resources: les made accessible to the judge during assessment 6 .The speci cation of how these resources must be structured and how they are used during assessment is completely up to the judge developers.Finally, the con guration might also contain boilerplate code: a skeleton students can use to start the implementation that is provided in the code editor along with the description.
Taken together, a Docker image, a judge and a programming assignment con guration (including both a description and an assessment con guration) constitute a task package as de ned by Verhoe (2008): a unit Dodona uses to render the description of the assignment and to automatically assess its submissions.However, Dodona's layered design embodies the separation of concerns (Laplante, 2007) needed to develop, update and maintain the three modules in isolation and to maximize their reuse: multiple judges can use the same Docker image and multiple programming assignments can use the same judge.Related to this, an explicit design goal for judges is to make the assessment con guration for individual assignments as lightweight as possible.After all, minimal con gurations reduce the time and e ort teachers and instructors need to create programming assignments that support automated assessment.Sharing of data les and multimedia content among the programming assignments in a repository also implements the inheritance mechanism for bundle packages as hinted by (Verhoe , 2008).Another form of inheritance is specifying default assessment con gurations at the directory level, which takes advantage of the hierarchical grouping of learning activities in a repository to share common settings.

Internationalization and localization
Internationalization (i18n) is a shared responsibility between Dodona, learning activities and judges.All boilerplate text in the user interface that comes from Dodona itself is supported in English and Dutch, and users can select their preferred language.Content creators can specify descriptions of learning activities in both languages, and Dodona will render a learning activity in the user's preferred language if available.When users submit solutions for a programming assignment, their preferred language is passed as submission metadata to the judge.It's then up to the judge to take this information into account while generating feedback.
Dodona always displays localized deadlines based on a time zone setting in the user pro le, and users are warned when the current time zone detected by their browser di ers from the one in their pro le.
Figure 5: A student (Matilda) previously asked a question that has already been answered by her teacher (Miss Honey).Based on this response, the student is now asking a follow-up question that can be formatted using Markdown.

Questions, answers and code reviews
A downside of using discussion forums in programming courses is that students can ask questions about programming assignments that are either disconnected from their current implementation or contain code snippets that may give away (part of) the solution to other students (Nandi et al., 2012).Dodona therefore allows students to address teachers with questions they directly attach to their submitted source code.We support both general questions and questions linked to speci c lines of their submission (Figure 5).Questions are written in Markdown (e.g., to include markup, tables, syntax highlighted code snippets or multimedia), with support for MathJax (e.g., to include mathematical formulas).
Teachers are noti ed whenever there are pending questions (Figure 1).They can process these questions from a dedicated dashboard with live updates (Figure 6).The dashboard immediately guides them from an incoming question to the location in the source code of the submission it relates to, where they can answer the question in a similar way as students ask questions.To avoid questions being inadvertently handled simultaneously by multiple teachers, they have a three-state lifecycle: pending, in progress and answered.In addition to teachers changing question states while answering them, students can also mark their own questions as being answered.The latter might re ect the rubber duck debugging (Hunt, 1999) e ect that is triggered when students are forced to explain a problem to someone else while asking questions in Dodona.Teachers can (temporarily) disable the option for students to ask questions in a course, e.g. when a course is over or during hands-on sessions or exams when students are expected to ask questions face-to-face rather than online.
Manual source code annotations from students (questions) and teachers (answers) are rendered in the same way as source code annotations resulting from automated assessment.They are mixed in the source code displayed in the "Code" tab, showing their complementary nature.It is not required that students take the initiative for the conversation.Teachers can also start adding source code annotations while reviewing a submission.Such code reviews will be used as a building block for manual assessment.
Figure 6: Live updated dashboard showing all incoming questions in a course while asking questions is enabled.Questions are grouped into three categories: unanswered, in progress and answered.

Manual assessment
Teachers can create an evaluation for a series to manually assess student submissions for its programming assignments after a speci c period, typically following the deadline of some homework, an intermediate test or a nal exam.The evaluation embodies all programming assignments in the series and a group of students that submitted solutions for these assignments.Because a student may have submitted multiple solutions for the same assignment, the last submission before a given deadline is automatically selected for each student and each assignment in the evaluation.This automatic selection can be manually overruled afterwards.The evaluation deadline defaults to the deadline set for the associated series, if any, but an alternative deadline can be selected as well.
Evaluations support two-way navigation through all selected submissions: per assignment and per student.For evaluations with multiple assignments, it is generally recommended to assess per assignment and not per student, as students can build a reputation throughout an assessment (Malou and Thorsteinsson, 2016).As a result, they might be rated more favorably with a moderate solution if they had excellent solutions for assignments that were assessed previously, and vice versa (Malou et al., 2013).Assessment per assignment breaks this reputation as it interferes less with the quality of previously assessed assignments from the same student.Possible bias from the same sequence e ect is reduced during assessment per assignment as students are visited in random order for each assignment in the evaluation.In addition, anonymous mode can be activated as a measure to eliminate the actual or perceived halo e ect conveyed through seeing a student's name during assessment (Lebuda and Karwowski, 2013).While anonymous mode is active, all students are automatically pseudonymized.Anonymous mode is not restricted to the context of assessment and can be used across Dodona, for example while giving in-class demos.
When reviewing a selected submission from a student, assessors have direct access to the feedback that was previously generated during automated assessment: source code annotations in the "Code" tab and other structured and unstructured feedback in the remaining tabs.Moreover, next to the feed-Figure 7: Manual assessment of a submission: a teacher (Miss Honey) is giving feedback on the source code by adding inline annotations and is grading the submission by lling up the scoring rubric that was set up for the programming assignment "The Feynman ciphers".back that was made available to the student, the speci cation of the assignment may also add feedback generated by the judge that is only visible to the assessor.Assessors might then complement the assessment made by the judge by adding source code annotations as formative feedback and by grading the evaluative criteria in a scoring rubric as summative feedback (Figure 7).Previous annotations can be reused to speed up the code review process, because remarks or suggestions tend to recur frequently when reviewing submissions for the same assignment.Grading requires setting up a speci c scoring rubric for each assignment in the evaluation, as a guidance for evaluating the quality of submissions (Dawson, 2017;Popham, 1997).The evaluation tracks which submissions have been manually assessed, so that analytics about the assessment progress can be displayed and to allow multiple assessors working simultaneously on the same evaluation, for example one (part of a) programming assignment each.
An important di erence with automated assessment is that the feedback from manual assessment is delivered asynchronously.Students can only see manual source code annotations and grades after a teacher has explicitly published the feedback from an evaluation.This allows publishing the feedback from manual assessment to all students at once.In response, students can start asking questions about the feedback and grades they received, given that the option to ask questions is enabled in the course.An evaluation also provides teachers with summary statistics and overview reports that can be exported to external grade books.

Implementation
Dodona has a multi-tier service architecture, which delegates separate parts of the application to dedicated servers or virtual machines.This increases robustness to failure and improves reliability and scalability when serving hundreds of concurrent users.More speci cally, the web server, database (MySQL), caching system (Memcached) and Python Tutor each run on their own machine.In addition, a scalable pool of interchangeable worker servers are available to automatically assess incoming student submissions.
The web server is the only public-facing part of Dodona, running a Ruby on Rails web application that is available on GitHub under the permissive MIT open-source license10 .The user interface is built using Bootstrap11 and follows the Google Material Design speci cations12 .This ensures a coherent and accessible design that works in all modern web browsers.Dark mode is supported for programmers who fear that light attracts bugs.Next to the graphical web interface, Dodona also provides an application programming interface (API) using JSON, allowing external applications and scripts to interoperate with Dodona.
Software developers outside the core Dodona development team have used the API to implement IDE plug-ins for Visual Studio Code13 , DrRacket14 and JetBrains IDEs like IntelliJ IDEA, PyCharm or Webstorm15 .These IDE extensions directly embed support for submitting solutions and feedback noti cations into the environment where students work on their assignments.The plug-ins hook onto the Dodona API with secure authentication through API tokens.In addition, Dodona provides LTI 1.316 support for seamless integration of courses and learning activities in external learning management systems through single sign-on and deep linking.

Reliability and security
Dodona needs to operate in a challenging environment where students simultaneously submit untrusted code to be executed on its servers ("remote code exection by design") and expect automatically generated feedback, ideally within a few seconds.Many design decisions are therefore aimed at maintaining and improving the reliability and security of its systems.
With respect to reliability, Dodona minimizes the risk that its systems get overwhelmed by requests and submissions.This is especially important during tests and exams when large groups of students simultaneously submit solutions during the nal minutes before a deadline.System overload is mitigated through a multi-tier architecture that puts student submissions in a job queue upon arrival.Worker servers then assess submissions on a rst come, rst serve basis.A sudden in ux of submissions may cause the job queue to become longer temporarily, but apart from slightly longer waiting times for feedback delivery, this has no other adverse side-e ects.To prevent resource hogging of individual submissions, all source code is assessed server-side in separate Docker containers with strict limits on disk, memory, and CPU usage.These limits restrict the impact of processing individual submissions on other jobs running on the same server.If, despite everything, things do go wrong and a worker server succumbs under the load, other worker servers from the pool will pick up the slack at the cost of only slightly longer wait times.Moreover, in such cases, strict operations monitoring will notify server administrators at once.
Security is a much broader topic and signi cantly harder to fully safeguard.Looking at the OWASP Top Ten web application security risks17 , almost all of them apply to Dodona and mandate a mitigation strategy.We prevent broken access control by employing the widely-used pundit18 library to make account permissions clearly readable in policy les.Access control is checked upon each request.Standard implementations of security algorithms and standard con gurations of applications are used to preclude cryptographic failures.ActiveRecord and standard CSRF protection from Rails avert injection attacks.The additional risk of injections through user-provided content is mitigated by sandboxing content in iframes and using a strict content security policy (CSP).Content that can not be sandboxed properly is sanitized using standard Rails support.We follow standard Rails development patterns to ensure a secure design for Dodona.All software updates are reviewed by at least two other developers familiar with the Dodona codebase and tested automatically through continuous integration pipelines.The codebase is screened by static source code analysis to detect known vulnerabilities.Dependabot19 automatically monitors external dependencies to assure that no outdated libraries or libraries with known vulnerabilities are used (Alfadel et al., 2021).This also applies to the Docker images in which student submissions are assessed.Identi cation and authentication failures are bypassed through decentralized authentication and authorization.Finally, we use Grafana20 to actively monitor the application and the servers through internal dashboards that provide up-to-date status reports and to automatically trigger alerts if system indicators show abnormal trends or events.Additional alerts are triggered each time an error occurs in the Dodona web application, informing us about bugs or other problems.
3 Managing active learning in introductory programming: a case study Since the academic year 2011-2012 we have organized an introductory Python course at Ghent University (Belgium) with a strong focus on active and online learning.Initially the course was o ered twice a year in the rst and second term, but from academic year 2014-2015 onwards it was only o ered in the rst term.The course is taken by a mix of undergraduate, graduate, and postgraduate students enrolled in various study programmes (mainly formal and natural sciences, but not computer science), with 442 students enrolled for the 2021-2022 edition21 .

Course structure
Each course edition has a xed structure, with 13 weeks of educational activities subdivided in two successive instructional units that each cover ve topics of the Python programming language -one topic per week -followed by a graded test about all topics covered in the unit (Figure 8).The nal exam at the end of the term evaluates all topics covered in the entire course.Students who fail the course during the rst exam in January can take a resit exam in August/September that gives them a second chance to pass the exam.Each week in which a new programming topic is covered, students must try to solve six programming assignments on that topic before a deadline one week later.That results in 60 mandatory as-Figure 8: Top: Structure of the Python course that runs each academic year across a 13-week term (September-December). Programming assignments from the same Dodona series are stacked vertically.Students submit solutions for ten series with six mandatory assignments, two tests with two assignments and an exam with three assignments.They can also take a resit exam with three assignments in August/September if they failed the rst exam in January.Each series of mandatory assignments has a dedicated topic: (1) variables, expressions and statements, (2) conditional statements, (3) loops, (4) strings, (5) functions, (6) lists and tuples, (7) more about functions and modules, (8) sets and dictionaries, (9) text les, (10) object-oriented programming.Bottom: Heatmap from Dodona learning analytics page showing distribution per day of all 331 734 solutions submitted during the 2021-2022 edition of the course (442 students).The darker the color, the more solutions were submitted that day.A lighter shade of fuchsia means few solutions were submitted that day.A light gray square means no solutions were submitted that day.Weekly lab sessions for di erent groups on Monday afternoon, Friday morning and Friday afternoon, where we can see darker squares.Weekly deadlines for mandatory assignments on Tuesdays at 22:00.Three exam sessions for di erent groups in January.Low activity in exam periods, except for days where an exam was taken.The course is not taught in the second term, so this low-activity period was collapsed.Two more exam sessions for di erent groups in August/September, granting an extra chance to students who failed on their exam in January.
signments across the semester.Following the ipped classroom strategy (Akçayır and Akçayır, 2018;Bishop and Verleger, 2013), students prepare themselves to achieve this goal by reading the textbook chapters covering the topic.Lectures are interactive programming sessions that aim at bridging the initial gap between theory and practice, advancing concepts, and engaging in collaborative learning (Tucker, 2012).Along the same lines, the rst assignment for each topic is an ISBN-themed programming challenge whose model solution is shared with the students, together with an instructional video that works step-by-step towards the model solution.As soon as students feel they have enough understanding of the topic, they can start working on the ve remaining mandatory assignments.Students can work on their programming assignments during weekly computer labs, where they can collaborate in small groups and ask help from teaching assistants.They can also work on their assignments and submit solutions outside lab sessions.In addition to the mandatory assignments, students can further elaborate on their programming skills by tackling additional programming exercises they select from a pool of over 850 exercises linked to the ten programming topics.Submissions for these additional exercises are not taken into account in the nal grade.

Assessment, feedback and grading
We use the online learning environment Dodona to promote active learning through problem solving (Prince, 2004).Each course edition has its own dedicated course in Dodona, with a learning path containing all mandatory, test and exam assignments, grouped into series with corresponding deadlines.Mandatory assignments for the rst unit are published at the start of the semester, and those for the second unit after the test of the rst unit.For each test and exam we organize multiple sessions for di erent groups of students.Assignments for test and exam sessions are provided in a hidden series that is only accessible for students participating in the session using a shared secret link.The test and exam assignments are published afterwards for all students, when grades are announced.Students can see class progress when working on their mandatory assignments to nudge them to avoid procrastination.Only teachers can see class progress for test and exam series so as not to accidentally stress out students.For the same reason, we intentionally organize tests and exams following exactly the same procedure, so that students can take high-stake exams in a familiar context and adjust their approach based on previous experiences.The only di erence is that test assignments are not as hard as exam assignments, as students are still in the midst of learning programming skills when tests are taken.
Students are stimulated to use an integrated development environment (IDE) to work on their programming assignments.IDEs bundle a battery of programming tools to support today's generation of software developers in writing, building, running, testing and debugging software.Working with such tools can be a true blessing for both seasoned and novice programmers, but there is no silver bullet (Brooks and Kugler, 1987).Learning to code remains inherently hard (Kelleher et al., 2002) and consists of challenges that are di erent to reading and learning natural languages (Fincher, 1999).As an additional aid, students can continuously submit (intermediate) solutions for their programming assignments and immediately receive automatically generated feedback upon each submission, even during tests and exams.Guided by that feedback, they can track potential errors in their code, remedy them and submit updated solutions.There is no restriction on the number of solutions that can be submitted per assignment.All submitted solutions are stored, but for each assignment only the last submission before the deadline is taken into account to grade students.This allows students to update their solutions after the deadline (i.e. after model solutions are published) without impacting their grades, as a way to further practice their programming skills.One e ect of active learning, triggered by mandatory assignments with weekly deadlines and intermediate tests, is that most learning happens during the term (Figure 8).In contrast to other courses, students do not spend a lot of time practicing their coding skills for this course in the days before an exam.We want to explicitly motivate this behavior, because we strongly believe that one cannot learn to code in a few days' time (Peter Norvig, 2001).
We originally developed a custom Python judge for SPOJ (Sphere Online Judge) to automatically assess student submissions for our own collection of programming assignments (Kosowski et al., 2008).However, after ve years of running the course, we felt the shortcomings of SPOJ or other existing online learning environments prevented us from expanding our vision on active learning and providing rich feedback just-in-time and in a scalable way.In response, we started developing Dodona in spring 2016, ported our Python judge and collection of programming assignments over the summer, and ran our rst Python course with Dodona during the rst term of academic year 2016-2017.Until this day, structurally designing and modeling the Python course and its pedagogy remains a driving force for further extending Dodona and for validating novel or improved features in educational practice.Along the way, other computer science, data science and statistics courses adopted Dodona with a variation of learning contexts, programming languages and assessment requirements.But more on this in the next section.
For the assessment of tests and exams, we follow the line of thought that human expert feedback through source code annotations is a valuable complement to feedback coming from automated assessment, and that human interpretation is an absolute necessity when it comes to grading (Ala-Mutka, 2005;Jackson and Usher, 1997;Staubitz et al., 2015).We shifted from paper-based to digital code reviews and grading when support for manual assessment was released in version 3.7 of Dodona (summer 2020).Although online reviewing positively impacted our productivity, the biggest gain did not come from an immediate speed-up in the process of generating feedback and grades compared to the paperbased approach.While time-on-task remained about the same, our online source code reviews were much more elaborate than what we produced before on printed copies of student submissions.This was triggered by improved reusability of digital annotations and the foresight of streamlined feedback delivery.Where delivering custom feedback only requires a single click after the assessment of an evaluation has been completed in Dodona, it took us much more e ort before to distribute our paperbased feedback.Students were direct bene ciaries from more and richer feedback, as observed from the fact that 75% of our students looked at their personalized feedback within 24 hours after it had been released, before we even published grades in Dodona.What did not change is the fact that we complement personalized feedback with collective feedback sessions in which we discuss model solutions for test and exam assignments, and the low numbers of questions we received from students on their personalized feedback.As a future development, we hope to reduce the time spent on manual assessment through improved computer-assisted reuse of digital source code annotations in Dodona.
We accept to primarily rely on automated assessment as a rst step in providing formative feedback while students work on their mandatory assignments.After all, a back-of-the-envelope calculation tells it would take us 72 full-time equivalents (FTE) to generate equivalent amounts of manual feedback for mandatory assignments compared to what we do for tests and exams.In addition to volume, automated assessment also yields the responsiveness needed to establish an interactive feedback loop throughout the iterative software development process while it still matters to students and in time for them to pay attention to further learning or receive further assistance (Gibbs and Simpson, 2005).Automated assessment thus allows us to motivate students working through enough programming assignments and to stimulate their self-monitoring and self-regulated learning (Pintrich, 1995;Schunk and Zimmerman, 1994).It results in triggering additional questions from students that we manage to respond to with one-to-one personalized human tutoring, either synchronously during hands-on sessions or asynchronously through Dodona's Q&A module.We observe that individual students seem to have a strong bias towards either asking for face-to-face help during hands-on sessions or asking questions online.This could be in uenced by the time when they mainly work on their assignments, by their way of collaboration on assignments, or by reservations because of perceived threats to self-esteem or social embarrassment (Karabenick and Knapp, 1991;Newman and Schwager, 1993).
In computing a nal score for the course, we try to nd an appropriate balance between stimulating students to nd solutions for programming assignments themselves and collaborating with and learning from peers, instructors and teachers while working on assignments.The nal score is computed as the sum of a score obtained for the exam (80%) and a score for each unit that combines the student's performance on the mandatory and test assignments (10% per unit).We use Dodona's grading module to determine scores for tests and exams based on correctness, programming style, choice made between the use of di erent programming techniques and the overall quality of the implementation.
The score for a unit is calculated as the score  for the two test assignments multiplied by the fraction  of mandatory assignments the student has solved correctly.A solution for a mandatory assignment is considered correct if it passes all unit tests.Evaluating mandatory assignments therefore doesn't require any human intervention, except for writing unit tests when designing the assignments, and is performed entirely by our Python judge.In our experience, most students traditionally perform much better on mandatory assignments compared to test and exam assignments (Glass and Kang, 2022), given the possibilities for collaboration on mandatory assignments.

Open and collaborative learning environment
We strongly believe that e ective collaboration among small groups of students is bene cial for learning (Prince, 2004), and encourage students to collaborate and ask questions to tutors and other students during and outside lab sessions.We also demonstrate how they can embrace collaborative coding and pair programming services provided by modern integrated development environments (Hanks et al., 2011;Williams et al., 2002).But we recommend them to collaborate in groups of no more than three students, and to exchange and discuss ideas and strategies for solving assignments rather than sharing literal code with each other.After all, our main reason for working with mandatory assignments is to give students su cient opportunity to learn topic-oriented programming skills by applying them in practice and shared solutions spoil the learning experience.The factor  in the score for a unit encourages students to keep netuning their solutions for programming assignments until all test cases succeed before the deadline passes.But maximizing that factor without proper learning of programming skills will likely yield a low test score  and thus an overall low score for the unit, even if many mandatory exercises were solved correctly.
Fostering an open collaboration environment to work on mandatory assignments with strict deadlines and taking them into account for computing the nal score is a potential promoter for plagiarism, but using it as a weight factor for the test score rather than as an independent score item should promote learning by avoiding that plagiarism is rewarded.It takes some e ort to properly explain this to students.We initially used Moss (Schleimer et al., 2003) and now use Dolos (Maertens et al., 2022) to monitor submitted solutions for mandatory assignments, both before and at the deadline.The solution space for the rst few mandatory assignments is too small for linking high similarity to plagiarism: submitted solutions only contain a few lines of code and the diversity of implementation strategies is small.But at some point, as the solution space broadens, we start to see highly similar solutions that are reliable signals of code exchange among larger groups of students.Strikingly this usually happens among students enrolled in the same study programme (Figure 9).As soon as this happens -typically in week 3 or 4 of the course -plagiarism is discussed during the next lecture.Usually this is a lecture about working with the string data type, so we can introduce plagiarism detection as a possible application of string processing.
In an intermezzo entitled "copy-paste ≠ learn to code" we show students some pseudonymized Dolos plagiarism graphs that act as mirrors to make them re ect upon which node in the graph they could be (Figure 9).We stress that the learning e ect dramatically drops in groups of four or more students.Typically we notice that in such a group only one or a few students make the e ort to learn to code, while the other students usually piggyback by copy-pasting solutions.We make students aware that understanding someone else's code for programming assignments is a lot easier than trying to nd Figure 9: Dolos plagiarism graphs for the Python programming assignment "-ramidal constants" that was created and used for a test of the 2020-2021 edition of the course (left) and reused as a mandatory assignment in the 2021-2022 edition (right).Graphs constructed from the last submission before the deadline of 142 and 382 students respectively.Nodes represent student submissions and their colors represent study programmes as taken from user labels in Dodona.Edges connect highly similar pairs of submissions, with similarity threshold set to 0.8 in both graphs.Edge directions are based on submission timestamps in Dodona.Clusters of connected nodes are highlighted with a distinct background color and have one node with a solid border that indicates the rst correct submission among all submissions in that cluster.All students submitted unique solutions during the test, except for two students who confessed they exchanged a solution during the test.Submissions for the mandatory assignment show that most students work either individually or in groups of two or three students, but we also observe some clusters of four or more students that exchanged solutions and submitted them with hardly any varying types and amounts of modi cations.This case was used to warn students about the negative learning e ect of copying solutions from each other.solutions themselves.Over the years, we have experienced that a lot of students are caught in the trap of genuinely believing that being able to understand code is the same as being able to write code that solves a problem until they take a test at the end of a unit.That's where the  factor of the test score comes into play.After all, the goal of summative tests is to evaluate if individual students have acquired the skills to solve programming challenges on their own.
When talking to students about plagiarism, we also point out that the plagiarism graphs are directed graphs, indicating which student is the potential source of exchanging a solution among a cluster of students.We speci cally address these students by pointing out that they are probably good at programming and might want to exchange their solutions with other students in a way to help their peers.But instead of really helping them out, they actually take away learning opportunities from their fellow students by giving away the solution as a spoiler.Stated di erently, they help maximize the factor  but e ectively also reduce the  factor of the test score, where both factors need to be high to yield a high score for the unit.After this lecture, we usually notice a stark decline in the amount of plagiarized solutions.
The goal of plagiarism detection at this stage is prevention rather than penalisation, because we want students to take responsibility over their learning.The combination of realizing that teachers and instructors can easily detect plagiarism and an upcoming test that evaluates if students can solve programming challenges on their own, usually has an immediate and persistent e ect on reducing cluster sizes in the plagiarism graphs to at most three students.At the same time, the signal is given that plagiarism detection is one of the tools we have to detect fraud during tests and exams.The entire group of students is only addressed once about plagiarism, without going into detail about how plagiarism detection itself works, because we believe that overemphasizing this topic is not very e ective and explaining how it works might drive students towards spending time thinking on how they could bypass the detection process -time better spent on learning to code.Every three or four years we see a persistent cluster of students exchanging code for mandatory assignments over multiple weeks.If this is the case, we individually address these students to point them again on their responsibilities, again di erentiating between students that share their solution and students that receive solutions from others.
Tests and exams, on the other hand, are taken on-campus under human surveillance and allow no communication with fellow students or other persons.Students can work on their personal computers and get exactly two hours to solve two programming assignments during a test, and three hours and thirty minutes to solve three programming assignments during an exam.Tests and exams are "open book/open Internet", so any hard copy and digital resources can be consulted while solving test or exam assignments.Students are instructed that they can only be passive users of the Internet: all information available on the Internet at the start of a test or exam can be consulted, but no new information can be added.When taking over code fragments from the Internet, students have to add a proper citation as a comment in their submitted source code.After each test and exam, we again use Moss/Dolos to detect and inspect highly similar code snippets among submitted solutions and to nd convincing evidence they result from exchange of code or other forms of interpersonal communication (Figure 9).If we catalog cases as plagiarism beyond reasonable doubt, the examination board is informed to take further action (Maertens et al., 2022).

Workload for running a course edition
To organize "open book/open Internet" tests and exams that are valid and reliable, we always create new assignments and avoid assignments whose solutions or parts thereof are readily available online.At the start of a test or exam, we share a secret link that gives students access to the assignments in a hidden series on Dodona.
For each edition of the course, mandatory assignments were initially a combination of selected test and exam exercises reused from the previous edition of the course and newly designed exercises.The former to give students an idea about the level of exercises they can expect during tests and exams, and the latter to avoid solution slippage.As feedback for the students we publish sample solutions for all mandatory exercises after the weekly deadline has passed.This also indicates that students must strictly adhere to deadlines, because sample solutions are available afterwards.As deadlines are very clear and adjusted to timezone settings in Dodona, we never experience discussions with students about deadlines.
After nine editions of the course, we felt we had a large enough portfolio of exercises to start reusing mandatory exercises from four or more years ago instead of designing new exercises for each edition.However, we still continue to design new exercises for each test and exam.After each test and exam, exercises are published and students receive manual reviews on the code they submitted, on top of the automated feedback they already got during the test or exam.But in contrast to mandatory exercises we do not publish sample solutions for test and exam exercises, so that these exercises can be reused during the next edition of the course.When students ask for sample solutions of test or exam exercises, we explain that we want to give the next generation of students the same learning opportunities they had.
So far, we have created more than 850 programming assignments for this introductory Python course alone.All these assignments are publicly shared on Dodona as open educational resources (Caswell et al., 2008;Downes, 2007;Hylén, 2021;Tuomi, 2013;Wiley et al., 2014).They are used in many other courses on Dodona (on average 10.8 courses per assignment) and by many students (on average 503.7 students and 4801.5 submitted solutions per assignment).We estimate that it takes about 10 person-hours on average to create a new assignment for a test or an exam: 2 hours for ideation, 30 minutes for implementing and tweaking a sample solution that meets the educational goals of the assignment and can be used to generate a test suite for automated assessment, 4 hours for describing the assignment (including background research), 30 minutes for translating the description from Dutch into English, one hour to con gure support for automated assessment, and another 2 hours for reviewing the result by some extra pair of eyes.
Generating a test suite usually takes 30 to 60 minutes for assignments that can rely on basic test and feedback generation features that are built into the judge.The con guration for automated assessment might take 2 to 3 hours for assignments that require more elaborate test generation or that need to extend the judge with custom components for dedicated forms of assessment (e.g.assessing nondeterministic behavior) or feedback generation (e.g.generating visual feedback).Keuning et al. (2018) found that publications rarely describe how di cult and time-consuming it is to add assignments to automated assessment platforms, or even if this is possible at all.The ease of extending Dodona with new programming assignments is re ected by more than 10 thousand assignments that have been added to the platform so far.Our experience is that con guring support for automated assessment only takes a fraction of the total time for designing and implementing assignments for our programming course, and in absolute numbers stays far away from the one person-week reported for adding assignments to Bridge (Bonar and Cunningham, 1988).Because the automated assessment infrastructure of Dodona provides common resources and functionality through a Docker container and a judge, the assignment-speci c con guration usually remains lightweight.Only around 5% of the assignments need extensions on top of the built-in test and feedback generation features of the judge.
So how much e ort does it cost us to run one edition of our programming course?For the most recent 2021-2022 edition we estimate about 34 person-weeks in total (Table 1), the bulk of which is spent on on-campus tutoring of students during hands-on sessions (30%), manual assessment and grading (22%), and creating new assignments (21%).About half of the workload (53%) is devoted to summative feedback through tests and exams: creating assignments, supervision, manual assessment and grading.Most of the other work (42%) goes into providing formative feedback through on-campus and online assistance while students work on their mandatory assignments.Out of 2215 questions that students  (Gordon et al., 2013).
asked through Dodona's online Q&A module, 1983 (90%) were answered by teaching assistants and 232 (10%) were marked as answered by the student who originally asked the question.Because automated assessment provides rst-line support, the need for human tutoring is already heavily reduced.We have drastically cut the time we initially spent on mandatory assignments by reusing existing assignments and because the Python judge is stable enough to require hardly any maintenance or further development.

Learning analytics and educational data mining
A longitudinal analysis of student submissions across the term shows that most learning happens during the 13 weeks of educational activities and that students don't have to catch up practicing their programming skills during the exam period (Figure 8).Active learning thus e ectively avoids procrastination.We observe that students submit solutions every day of the week and show increased activity around hands-on sessions and in the run-up to the weekly deadlines (Figure 10).Weekends are also used to work further on programming assignments, but students seem to be watching over a good night's sleep.Throughout a course edition, we use Dodona's series analytics to monitor how students perform on our selection of programming assignments (Figure 11).This allows us to make informed decisions and appropriate interventions, for example when students experience issues with the automated assessment con guration of a particular assignment or if the original order of assignments in a series does not seem to align with our design goal to present them in increasing order of di culty.The rst students that start working on assignments usually are good performers.Seeing these early birds having trouble with solving one of the assignments may give an early warning that action is needed, as in improving the problem speci cation, adding extra tips & tricks, or better explaining certain programming concepts to all students during lectures or hands-on sessions.Reversely, observing that many students postpone working on their assignments until just before the deadline might indicate that some assign- ments are simply too hard at this moment in time through the learning pathway of the students or that completing the collection of programming assignments interferes with the workload from other courses.Such "deadline hugging" patterns are also a good breeding ground for students to resort on exchanging solutions among each other.
Using educational data mining techniques on historical data exported from several editions of the course, we further investigated what aspects of practicing programming skills promote or inhibit learning, or have no or minor e ect on the learning process (Van Petegem et al., 2022).It won't come as a surprise that mid-term test scores are good predictors for a student's nal grade, because tests and exams are both summative assessments that are organized and graded in the same way.However, we found that organizing a nal exam end-of-term is still a catalyst of learning, even for courses with a strong focus of active learning during weeks of educational activities.
In evaluating if students gain deeper understanding when learning from their mistakes while working progressively on their programming assignments, we found the old adage that practice makes perfect to depend on what kind of mistakes students make.Learning to code requires mastering two major competences: i) getting familiar with the syntax and semantics of a programming language to express the steps for solving a problem in a formal way, so that the algorithm can be executed by a computer, and ii) problem solving itself.It turns out that staying stuck longer on compilation errors (mistakes against the syntax of the programming language) inhibits learning, whereas taking progressively more time to get rid of logical errors (re ective of solving a problem with a wrong algorithm) as assignments get more complex actually promotes learning.After all, time spent in discovering solution strategies while thinking about logical errors can be reclaimed multifold when confronted with similar issues in later assignments (Glass and Kang, 2022).These ndings neatly align with the claim of Edwards et al. ( 2018) that problem solving is a higher-order learning task in Bloom's Taxonomy (analysis and synthesis) than language syntax (knowledge, comprehension, and application).
Using historical data from previous course editions, we can also make highly accurate predictions about what students will pass or fail the current course edition (Van Petegem et al., 2022).This can already be done after a few weeks into the course, so remedial actions for at-risk students can be started well in time.The approach is privacy-friendly as we only need to process metadata on student submissions for programming assignments and results from automated and manual assessment extracted from Dodona.Given that cohort sizes are large enough, historical data from a single course edition are already enough to make accurate predictions.

A broader perspective
At the release of Dodona version 6.0 (September 2022) -eleven years after we introduced automated assessment in our Python programming course and six years after we started developing Dodona -the list of supported features (Figure 12) and beyond our own course (Figure 13) have come a long way.The online learning platform is now used in more than 1 000 schools, colleges and universities22 mainly across Flanders (Belgium) and the Netherlands, where 36 thousand students altogether have submitted more than 11 million solutions for programming assignments.Renewed interest in embedding computational thinking into formal education has de nitely been an important stimulator for such a broad adoption (Wing, 2006).The careful design choices and state-of-the art software development that gave Dodona the versatility, exibility and accessibility needed to lower the barrier for educational technology to nd its way beyond the context for which it is was initially created (Rößling et al., 2008) also helped broaden adoption.This includes single sign-on and trusted identities through decentralized authentication, support for a wide variety of programming languages and assessment strategies through a generic infrastructure for automated assessment, transparent synchronization of content created in external git repositories, computer-assisted code reviews and grading, plagiarism detection, interoperability with learning management systems and IDEs.To stimulate community building, we organize annual TeachMeets where teachers gather to share ideas and good practices for teaching with Dodona.These events spark new ideas that we can integrate into Dodona and that teachers can adopt in designing their own courses.
Dodona now houses a collection of 10 thousand reading activities and programming assignments with support for automated assessment, which allows teachers to compose learning paths that best t their needs by mixing their own learning activities with activities designed by others.Needless to say that reuse of programming assignments is hampered by the fact that assignments target di erent audiences, learning contexts and learning objectives.We however tried to make content creation, publishing, and sharing as straightforward as possible, and we constantly seek for new opportunities to make assignments more FAIR: ndable, accessible, interoperable, and reusable (Wilkinson et al., 2016).E ective information-seeking is currently supported through faceted searching based on a exible labeling scheme for learning activities, featured courses, and rich metadata about learning activities that teachers can consult on an information page containing author attribution, general assessment con guration details, model solutions, usage statistics and other background information provided by the authors.
The variety of programming languages used in education (Crick, 2017) is re ected by the many judges that have been developed for Dodona, ranging from general purpose programming and scripting languages (Bash, C, C++, C#, Haskell, Java, JavaScript, Prolog, Python, Scheme) to languages dedicated to data science and statistics (R) (Nüst et al., 2020), databases (SQL) and web design (HTML & CSS).We can make three observations from educational software testing to provide automated feedback on solutions that students submit for programming assignments.First, implementing a judge takes time and careful design.Setting up an initial prototype is easily done, which explains why there are so many (industrial) software testing frameworks around.But then it becomes more challenging to make the judge generate rich feedback that supports student learning in the best possible way, while keeping assignment-speci c assessment con gurations as lightweight as possible by supporting many out-of-the-box test strategies ("fat judge / slim test suite").To avoid frustrated students and teachers, the performance of judges in educational practice must also bring top-quality in terms of speed, robustness and security (Peveler et al., 2019).Second, di erent educational judges share a lot of supported software testing strategies that are merely re-implemented to target di erent programming languages (Ala-Mutka, 2005;Caiza and Álamo Ramiro, 2013;Douce et al., 2005;Ihantola et al., 2010;Paiva et al., 2022;Wasik et al., 2018;Wilcox, 2016).Third, teachers often reuse programming assignments across  di erent programming languages.This may require modifying the problem description and/or creating a separate assessment con guration for each target language.
To make the design of assignments with support for automated assessment less dependent on a speci c programming language, Dodona also provides a judge based on TESTed: an educational test framework that uni es the speci cation of software tests across programming languages (Strijbol et al., 2022).This directly bene ts educators: assignments can be used across programming languages, designing assignments for di erent programming languages can be done using the same test framework, and the cost of providing automated assessment for new programming languages is dramatically reduced as TESTed implements the core components of automated assessment in a generic languageagnostic way.It only needs a thin layer of language speci c con gurations for each individual programming language and currently supports C, Haskell, Java, JavaScript, Kotlin and Python.Dodona itself also takes up some responsibilities for automated assessment through its generic infrastructure: provisioning a secure runtime environment and presenting feedback that is expressed in a standard way.At the same time, this infrastructure is not very restrictive as to how automated assessment can be performed.

Conclusions
The formative assessment model of Nicol and Macfarlane-Dick and their seven principles of good practice feedback (Nicol and Macfarlane-Dick, 2006) align very well with how Dodona facilitates students in self-regulated learning while practicing their programming skills and how it provides information to teachers that helps to shape their teaching.For students, active learning promises to reinforce learning by creating shorter feedback loops that help them make adjustments early on in the learning process.But scaling up feedback provisioning throughout the entire learning process might become a real bottleneck for teachers and instructors.Dodona therefore aims at saving valuable teacher time for maintaining a collaborative and responsive dialogue with students based on high-quality and timely feedback.However, while it may be the ultimate ideal for some, current educational technology does not yet allow to completely automate the entire feedback loop.Supporting the human aspect of learning and teaching is therefore an important focus in designing the Dodona user experience.Students can track and remedy potential errors in their code with a built-in graphical debugger, ask online questions directly on their submitted solutions with the integrated Q&A module, and monitor their own progress from learning analytics dashboards.Teachers can customize learning paths with their own learning materials and interactive assignments, share materials with their colleagues, monitor student progress (individually or in group) using learning analytics dashboards, organize high-stakes tests and exams with automated feedback, assess students with rich feedback using a grading module with support for code reviews, and detect and prevent plagiarism with dedicated and interactive tools.
Pushing the boundaries of Dodona as a virtual co-teacher that becomes gradually smarter at supporting or automating pedagogical tasks is an active area of our research.Observing how Dodona inspired so many colleagues to increasingly bring active and blended learning into their educational practice keeps broadening our vision and is in itself an inspiration for novel features.If reading this paper also triggered your curiosity to start exploring the platform, feel free to create an account and request teacher rights 23 that allow you to set up learning paths for your own courses and create your own programming assignments with support for automated assessment.If you are more tech savvy, you can also develop your own judges to support automated assessment according to your own pedagogical vision.Sharing these learning materials with colleagues is one way to contribute to the platform.In addition, the source code of Dodona is made publicly available on GitHub 24 , where we welcome bug reports and feature requests documented as issues, user experiences shared as discussions, and contributions submitted as pull requests.Over the last six years Dodona has traveled wherever its code has taken us, so wherever you go, may the source be with you.

Figure 4 :
Figure 4: Distributed content management model that allows to seamlessly integrate custom learning activities (reading activities and programming assignments with support for automated assessment) and judges (frameworks for automated assessment) into Dodona.Content creators manage their content in external git repositories, keep ownership over their content, control who can co-create, and set up webhooks to automatically synchronize any changes with the content as published on Dodona.

Figure 10 :
Figure 10: Punchcard from the Dodona learning analytics page showing the distribution per weekday and per hour of all 331 734 solutions submitted during the 2021-2022 edition of the course (442 students).

Figure 11 :
Figure 11: Interactive learning analytics on student submission behavior across programming assignments in the series where (unnested) loops are introduced in the course (2021-2022 edition).Top: Distribution of the number of student submissions per programming assignment.The larger the zone, the more students submitted a particular number of solutions.Black dot indicates the average number of submissions per student.Middle: Distribution of top-level submission statuses per programming assignment.Bottom: Progression over time of the percentage of students that correctly solved each assignment.

Figure 12 :
Figure 12: Historical overview with milestones from running a Python course supported by automated assessment for eleven years and developing Dodona for six years.Vertical positions of labels indicate when milestones were introduced.We used a Python judge in the Sphere Online Judge (SPOJ) to vide automated assessment during the rst ve years of the course, which was afterwards migrated to Dodona.From left to right: features, judges and authentication-support introduced in SPOJ/Dodona.Numbers to the left of the tracks: major Dodona releases (learning environment), and number of institutions using Dodona (decentralized authentication).

Figure 13 :
Figure 13: Overview of the number of submitted solutions and active users by academic year.Users were active when they submitted at least one solution for a programming assignment during the academic year.

Table 1 :
Estimated workload to run the 2021-2022 edition of the introductory Python programming course for 442 students with 1 lecturer, 7 teaching assistants and 3 undergraduate students who serve as teaching assistants