Design Effective Assessments

Overview

Assessments can be used both to promote learning as well as provide useful insight into student progress toward a particular learning outcome. There are two general categories of assessments that highlight these different purposes: formative and summative.

Formative & Summative Assessments

Formative assessments are for learning. Their aim is to provide both the students and instructor with an idea of where their level of understanding is at the current moment, and enable the instructor to adjust accordingly to meet the emerging needs of the class. Formative assessments are typically “low stakes,” meaning that they are often ungraded or worth few points. In using formative assessments…

Instructors might consider: What evidence do I have that students in my course learned what I think they learned?* Do I need to re-explain that concept differently? Do I need to backtrack two steps and catch everyone up to where we are now? Do I need to change my pedagogical approach to engage this group of students?

Students might consider*: Which aspects of the course material should I spend more or less time on, based on my current understanding? What strategies am I using that are working well or not working well to help me learn? When I do an assignment or task like this again, what do I want to remember to do differently?

What are some examples of formative assessments? There are many different ways to relatively quickly gauge student thinking, often referred to as classroom assessment techniques (CATs). These can include strategies like:

Summative assessments are used to evaluate a cumulation of student learning, likely at the end of a unit or course. Summative assessments commonly take the form of exams or final papers or projects. They are used to determine the level at which students achieved the expectations for their learning and to identify instructional areas that may need additional attention. Summative assessments are often “high stakes,” meaning that they have high point values. 

Assessment is not synonymous with grading. Grading is a means of evaluation based on a set of criteria, which may not always directly reflect measures of learning (e.g., attendance, participation). Graded and ungraded assessments can be used as evidence of student learning.

*Tanner, Kimberly D. "Promoting student metacognition." CBE—Life Sciences Education 11.2 (2012): 113-120.

Designing Effective Final Exams

Final exams are a common form of summative assessment, but their quality depends on how well they provide meaningful and defensible evidence of student learning. Assessment scholarship identifies three core qualities of effective exams: validity, reliability, and freedom from bias (Banta & Palomba 2014). Designing with these principles in mind strengthens the credibility and interpretability of exam results.

Ensure Validity: Assess What You Intend to Measure

Validity refers to whether an exam measures the knowledge, skills, and forms of thinking it is intended to assess. A primary mechanism for achieving validity is constructive alignment, aligning learning goals, instructional activities, and assessment tasks (Fawns 2026). Final exams are most defensible when they are intentionally linked to stated course learning outcomes and reflect the types of thinking students have practiced throughout the term.

Prioritize What the Exam Should Represent

Final exams can feel unfocused when they attempt to test “everything.” Before drafting an exam, revisit your course’s learning objectives and identify what is most central: key concepts, disciplinary methods, and habits of thinking students should demonstrate. Prioritizing essential content clarifies what counts as evidence of learning and strengthens the interpretation of exam scores.

Match Question Types to Intended Thinking

Validity also depends on cognitive alignment. If your learning objectives emphasize application, analysis, or evaluation, exam questions should require students to demonstrate those forms of thinking.

Begin by identifying the cognitive action you want students to perform (e.g., recall, apply, analyze, evaluate). Then draft questions that explicitly require that action. Question format alone does not determine cognitive level; for example multiple-choice items can assess higher-order reasoning when they require interpretation or application, and essays can remain low-level if they simply invite memorized responses (Haladyna & Rodriguez 2013).

Draft Prompts That Measure the Intended Construct

Ambiguous or overly broad prompts can unintentionally shift what the exam measures. Vague verbs or unclear task descriptions may cause students to interpret the question differently, meaning performance reflects interpretation skill rather than disciplinary understanding.

To strengthen validity:

  • Use explicit task language that communicates the cognitive demand.
  • Replace broad verbs (e.g., “discuss”) with more specific actions (e.g., “identify and justify”).
  • Separate background context from the actual question stem.

These refinements help ensure that responses reflect the intended learning outcome.

Ensure Reliability: Score Responses Consistently

Reliability refers to the consistency of scoring. An exam is reliable when students who demonstrate similar levels of understanding receive similar scores, regardless of who grades the exam or when it is graded. Even a well-aligned (valid) question can produce unreliable results if grading standards are unclear or inconsistently applied. For example, if two graders interpret the same response differently, or if expectations are not clearly defined, scores may reflect interpretation rather than learning.

Define Evaluation Criteria in Advance

Develop grading criteria or a rubric alongside the exam rather than after responses are submitted. Research syntheses on rubrics show that well-designed rubrics can improve scoring reliability for complex, constructed responses (Jönsson & Svingby 2007.) Ask:

  • What would a strong response include?
  • What distinguishes satisfactory from excellent performance?
  • Would another grader apply these criteria similarly?

Articulating standards in advance helps ensure that scores reflect consistent application of expectations rather than subjective interpretation. To learn more about rubrics visit our Assessment Rubrics resource guide. 

Ensure Feasible Scope and Clear Structure

Reliability is also influenced by exam design. Questions should be appropriately scoped for the allotted time, and instructions should be clear enough that students understand what is required. Piloting the exam yourself or asking a colleague to review/take it can reveal ambiguities that might otherwise lead to inconsistent performance.

An exam can be reliable without being valid (consistently measuring the wrong thing), and it cannot be valid if it is not reliable. Effective assessment requires both alignment with learning goals and consistency in scoring.

Ensure Freedom from Bias: Promote Fairness

Effective exams minimize barriers unrelated to the intended learning. Bias can occur when wording, context, or structure disadvantages certain groups of students or introduces unnecessary difficulty.

To reduce bias:

  • Review questions for unnecessary linguistic complexity.
  • Avoid culturally specific references unrelated to course goals.
  • Ensure students have had opportunities to practice the types of tasks required.
  • Confirm that instructions and formatting are clear and accessible.

Improve Through Reflection

After grading, review patterns in student responses. Were there questions that generated unexpected confusion? Did some items fail to elicit the intended type of thinking? Reflecting on these patterns supports ongoing improvements in validity, reliability, and fairness across future iterations of the course.