Assessment with Analytic Rubrics


Grading Using Analytic Rubrics

BIOL108 214

Hi! I’m back again. Last time, I discussed how a holistic rubric can be used to score high-value assignments and provide reliable assessment of program outcomes. Today I would like to look explore analytic rubrics. Unlike holistic rubrics, these tools allow us as instructors to break complex student tasks into discrete and measurable pieces. Students are given separate scores for each subtask often associated with a comment or other formative feedback. The total score is used as a measure of their total progress toward mastering the overall project. This is the daily-double of assessment; I would like us to go for the trifecta. The missing piece in most cases is an analytical summarization and report of the overall findings for the project. This is not as hard as it may at first seem.

Like many instructors, I use analytic rubrics to score many different types of assignments in my courses. The most complex of these is a group research poster project performed over the course of an entire semester in lab. This project clearly maps onto the Ferris State University “Scientific Understanding” student learning outcome. Additionally, it maps onto the program and course outcomes regarding scientific communication and use of the scientific method. Therefore, it represents an excellent opportunity to collect assessment data with broad relevance to what we do in our unit.

Large and complicated projects require detailed instructions and clear scoring guidelines. To this end, I have developed an analytic rubric with nine different measurable parameters. These are each individually described and score and are completed sequentially during the semester. The score criteria and specific instructions for each component are given to the students in the form of a lab handout. Here are some examples of these documents.

  • Scientific Method – This first document explains the overall concepts behind the scientific method and sets the stage for the class’ own experimental design experience.
  • Hypothesis – The first step in performing a scientific experiment is to define a testable hypothesis.
  • References – Each student is required to perform some literature review for the project. These references eventually get culled down to a final list of citations for their project.
  • Protocols – The class must create their own experimental procedures. They must also take sample selection and proper controls into account
  • Figures and Captions – Following data collection, the class must attempt to illustrate their findings with tables, graphs, drawings, and photographs as they see fit.
  • Summary – The main findings of the project must be summarized.
  • Introduction – Once everything is in place, an interesting and factually accurate introduction must be crafted.
  • Layout – The students must work as a group to insert their work into a poster template for printing and posting in the department.
  • Participation – Student participation in the poster project is graded by both the instructor and their peers.
  • Group score – Individuals earn 80 points for their contribution and up to 20 points as a group score

The interesting part of this exercise comes in scoring it. I have created an Excel scoring sheet for these rubrics. By simply typing a letter into the corresponding score fields, the students’ contributions can be quickly scored and an overall summary of the course performance is collected. Summary statistics for each criterion are calculated and the class performance for each criterion is plotted in a series of column graphs. These data are suitable for saving as a PDF and submission into TracDat. With very little extra effort, I have achieved the trifecta of assessment! The students receive lots of formative feedback over the span of a semester. I collect summative scores for each student that are entered into the grade book. More importantly, I collect useful data that maps onto course-, program-, and university-level student learning outcomes. All that and I have not even touched our Course Management System (Blackboard). Next time I will show you how to make Blackboard work for you instead of against you in your assessment efforts.

Posted in Resources | Tagged , , , | Leave a comment

Assessment Using Holistic Rubrics


Holistic Assessment

Holistic

Hi – I’m back again! The new semester is now well underway and I am beginning to reach cruising speed. Based upon my rather slow writing speed and my teaching load this semester, I think that three posts per week will be my typical output for now. I plan to add to this blog on M/W/F mornings. We’ll see how that goes…

I have been discussing selected response (objective) assessments for the past few posts. I’d like to completely change gears today. What about high-stakes assessments (as in – you don’t graduate if you don’t pass) that are subjectively given a single score by an evaluator? I have to admit it – this prospect made me very uneasy (how’s that for a euphemism?) at first. Once I saw this type of assessment in action, though, I quickly a big fan of this approach.

Background

Before joining the faculty at Ferris State University, I was an assistant professor at California State University, Long Beach. Each member of the California State University system is required to assess the writing proficiency of all students graduating from that institution. CSULB elected to create a high-stakes writing exam (the WPE) to measure student writing proficiency. In it, students that have taken 65 credits of classes are required to write an analytic response to a provided prompt within 75 minutes. These responses need be read and evaluated by the faculty. CSULB has 35,000 students, so thousands of papers must be scored each year. To simplify this process, a holistic scoring rubric was developed and I have created a copy of it here. Each student paper was read by three separate faculty reviewers. Each person scored the paper using the holistic rubric and the submission received the total of those three scores. A passing paper needed a score of 11 or higher (two fours and a three). In addition, the three reviewer scores were not allowed to vary by more than one point overall. In cases where the score diverged by more than a point, an expert arbitrator made the final evaluation

Advantages

The main advantages of this approach are simplicity, speed, and reliability. The scoring rubric is easy to understand and to communicate to students. The WPE website includes study suggestions and an entire workbook for students to see sample essays and practice prompts. With thousands of paper to be read in triplicate, speed was essential. I am not a particularly fast reader, yet I was able to accurately score over 200 papers in about six hours on a Saturday reading. The accuracy of scoring can be evaluated by examining the inter-rater reliability (how close are the scores of separate raters when reading the same paper?). The overall reliability of this process was between 95 and 99% – very high. Only a handful of papers were ever sent to the expert arbitrators due to scoring discrepancies.

Disadvantages

The principle disadvantage of this approach is the disciple and training that is necessary to ensure that the process is fair and robust. Each of the prompts used for the writing assignments had to be tested to ensure that there was little or no bias (cultural, gender, etc.) introduced. In addition, in order to achiever reproducibility in scoring, all readers need to be trained. The first hour of any reading session began with range-finders. Papers the typified each of the possible scores on the rubric were selected and given to the readers to score. After rating the papers, all of the scores were tallied on a chalkboard and the relative merits or deficiencies of the papers were discussed. Once everyone agreed upon what a 4 or a 5 looked like (for example), they were given actual papers to grade. Another, smaller set of range-finders was also given after lunch to reinforce the standards.

This type of approach is probably not practical for most course-level assessments (though perhaps huge, multi-section General Education classes might consider it…). However, this may be something that we could apply at Ferris State in measuring program outcomes. What do you think? Does it look interesting or impossible to you?

Posted in Uncategorized | Tagged , , , | Leave a comment

Selected Response (Bloom)


Assessing Critical Thinking Using Selected Response Items

ThinkSorry for missing my self-imposed deadline yesterday! I was moving my youngest daughter down to Kendall College of Art and Design and it took much more time and effort than I expected. It seems like just last year that I was the freshman on the curb watching my parents’ reflection in the car mirror recede as they dropped me off at school. It seems very strange to be on the other side of the mirror now.

Well now to the task at hand. One of the most pointed criticisms of multiple-choice (selected-response) assessments is that they cannot be used to measure higher levels of cognitive ability. I will readily concede the point that most multiple-choice exams do not assess much beyond the level of understanding. The vast majority of the questions that ship out in the test banks from college textbooks do not require much more than simple memorization (remembering level stuff) with a few categorization/classification (understanding items) tossed in. This need NOT be the case. There is a fairly compelling argument that mid-level cognitive abilities (applying and analyzing) can be handled very well in a MCQ format. A relatively recent article addresses this specifically for the field of Biology (my personal interest) and I am providing a link here. I believe that it essential to include a substantial number of mid-level questions in our assessments regardless of whether the class is 100-, 200-, 300-, or 400-level. You can refer to my earlier post on leveling for a little more info on my assessment philosophy.

Some people think that MCQ assessments can also be used for upper level cognitive abilities (evaluating). I’m not so convinced about this. I know that problem-based learning can be evaluated with selected-response items. However, selecting the most appropriate choice isn’t really the same thing as independently evaluating a problem and its possible solutions. I occasionally try a MCQ for evaluation, but I’ve never been too happy with using those data. Creating just flatly cannot be assessed using these tools. Selected response exams are not a panacea for assessment.

I also use my selected response assessments to get at metacognition in my courses (students thinking about their thinking…). I have created a handout for my students that explains the different types of thinking required in my course and how I plan to measure it. Throwing problem-based learning at students without preparing them for it is unfair and leads to poor results (and upset students). I am attaching a copy of my document here for your comments. Next time, I want to discuss how selected response items can be implemented for course assessment. See your then.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Selected Response (construction)


Constructing Useful Selected Response Items

Mcq

Today, I would like to briefly introduce some practices that I use to improve the item quality on my multiple-choice assessments. Regardless of format that we use, we must remain mindful of reliability and validity whenever we try to measure student learning. The reliability of an assessment is a measure of scoring reproducibility. If we could repeatedly administer a highly reliable assessment to a class, the students would score at or about the same level each time. Well-constructed selected response assessments typically have high reliability. I will discuss how this is calculated and interpreted in a future post. The validity of an assessment is the degree to which the assignment actually measures what we set out to measure. This is where things could potentially get sticky for multiple-choice assessments.

One of the main threats to validity for MCQ is “test-wiseness”. You do not have to look very long on the internet to locate information with suggested strategies for taking multiple-choice exams. Here is a typical one. If we are to maintain validity when using MCQ, we must defeat (or at least mitigate) these strategies. This does not mean that we must write trick questions. Instead, we need to carefully construct our assessment items so as not to accidentally provide unintentional clues that allow students to correctly answer the questions without knowing the answer (guessing).

Most of what I’ve learned about multiple-choice items has come from reading Developing and Validating Multiple-Choice Test Items by Thomas Haladyna. Not exactly a summertime page-turner, but a great resource! Here are some suggestions that I’ve gleaned from it and other resources:

  • Each item should address one of your course outcomes.
    if it is not a part of your outcomes, why are you asking it?
  • Stems should be direct questions or clearly phrased statements.
    for example, War and Peace was written by _____. or Which of the following authors wrote The Great Gatsby?
  • Avoid grammatical clues in the responses.
    articles like a a/an or singular/plural
  • Be careful about how the choices are arranged.
    I list them from shortest to longest in length, alphabetically when they are similar in length, and from smallest to largest when numerical.
  • Use a randomized key to avoid creating patterns in the responses
    this prevents unintentional biasing toward any one response – like C.
  • Make sure that all of the possible answers are at least somewhat plausible to an uninformed reader. Unbelievable choices serve no function from an assessment point of view.
  • Do not use “all of the above”
  • Use “none of the above” sparingly
    if you do use this, make sure that it is the correct answer sometimes
  • If negatives or conditions are used, they should be highlighted
    for instance: Which of the following is NOT a prokaryote?

Haladyna suggests that the most efficient multiple-choice exams have three possible answers. Most textbooks use four or five item exams. He claims (and my own experience bears this out) is that most questions only have three functioning responses (those that the class chooses with much regularity). Items with just three choices tend to freak people out (I was initially pretty skeptical too). What about guessing?? Let’s do a little math… If I give a 20-item quiz with three options each, what is the chance that a student could score 70% (14 correct) or better by just blindly guessing? This can be calculated as p = 1 – binomialcdf(20,(1/3),13) = 8.788 x 10-4 or about 0.088%. Not too likely – and the odds are even longer when there are more items. I’m not worried.

One thing that I did fret about was adequately randomizing my answer key. Like many people, I tend to favor B or C over A and D (edge aversion, I suppose). To overcome this, I have created a spreadsheet that will create random keys of any length up to 1,000 questions. This one will generate four-option keys (I still use these). I may share variants later to generate 2-, 3-, and 5-option keys if you ask.

The really big issue with selected response items is whether or not they can be used to test critical thinking. I say yes! (up to a point) and that will be the subject of tomorrow’s post.

Posted in Uncategorized | Tagged , , , | 2 Comments

Selected Response


Selected Response Assessments

Testchoices

Over the next few weeks, I plan to spend some time covering the basics of the different sorts of assessment tools that are available for our courses. Each of these techniques have their own unique strengths and weaknesses. I do not view any one of them as the “ideal” method for assessment. They are, after all just tools that we use in trying to measure student learning, instructional effectiveness, and programmatic efficacy. Like any other tool – they can be handled skillfully or they can be poorly wielded. The fact that some amateur carpenters might injure themselves while using a nail gun does not the nailer a poor tool. A poorly conceived assessment strategy is similarly unproductive (though typically less physically painful). A master carpenter will use a variety of tools (each carefully chosen and carefully applied) to complete a task. Likewise, when we intelligently select from our assessment toolbox and apply them in our courses we can greatly enhance our students’ learning. This week, I would like to begin with a much-maligned assessment tool: the selected response (multiple-choice) exam.

Description

Selected-response assessments are composed of a series of questions or statements (items) that the students must answer. The question or statement for each item is usually called the stem. The students pick from a variety of potential answers – the correct response and one or more incorrect choices (usually called the distractors). The number of distractors may vary. However, most instructors use between one and four.

Advantages

Selected-response assessments have several important advantages (which is why this type of assessment is still so commonplace). The following is a brief list of some of the chief advantages of this sort of assessment:

  • They are objectively scored (either correct or incorrect)
  • They are easy to automate (Scantron, clickers, and online exams can be used)
  • They are quick to respond to (more questions and be asked on an exam
  • They are easy to analyze (there are a variety of statistical tests that can be applied to MC questions
  • They are relatively easy to construct and modify over time

Disadvantages

Selected response assessment have gotten quite a bit of bad press over the past couple of decades. Much of this criticism was very deserved. However, the real weaknesses that have been pointed out reflect more on poor assessment implementation rather than a fundamental weakness in the assessment format. Some weaknesses that are often attributed to selected-response assessments include:

  • Multiple-choice questions do not test higher levels of cognition (applying, analyzing, evaluating, and creating)
  • Multiple-choice questions favor students that are test-wise
  • Multiple-choice questions reward guessing rather than knowing
  • Multiple-choice questions are not authentic assessments (they do not reflect real-world situations)
  • Multiple-choice questions give very limited feedback (formative assessment) to the students

Over the next few days, I plan to share with you some of my thoughts and experiences with this sort of assessment. I think that I have some very practical and useful suggestions to leverage this sort of assignment in many different (but not all) courses. Tomorrow, I will discuss how to construct MCQ for maximum effect. See you then.

Posted in General interest | Tagged , , , , , | Leave a comment

Poll 1


How do you assess student learning in your courses?

I’m sort of curious; what types of assignments do y’all use in your courses to assess student learning? I plan to discuss some of the advantages and disadvantages of different approaches over the next few weeks. Why do you use the assignment types that you have chosen? Take a moment to vote on this poll (you can vote for more than one type of assignment if you like) and please feel free to post comments.

Posted in General interest | Tagged , | Leave a comment

Leveling


What level(s) are you assessing at?

Bloom

Appropriate leveling is an important consideration when developing learning outcomes and assessment strategies for our courses. I think that we probably all agree that a senior-level capstone biology course ought to require a different level of cognitive effort than a freshman introductory biology course. The tricky part is moving from that nice safe generic statement to a more specific and actionable one. What cognitive abilities are appropriate at different course levels? How can our students demonstrate these abilities and how can we best measure them? What do we mean by cognitive levels anyways?

For over five decades, Bloom’s taxonomy of the cognitive learning domain has been used to define different levels of critical thinking. I personally prefer the recent (relatively speaking) modification of this system described in A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives by Anderson and Krathwohl. The revised taxonomy is actually laid out in a two-dimensional matrix that is defined by a knowledge dimension (what is known?) and a cognitive dimension (at what level of abstraction is it known?). They also provide a variety of specific examples to aid in proper categorization of items within our courses. One useful interactive website that illustrates this new taxonomy can be found at the Center for Excellence in Learning and Teaching at Iowa State University.

Well let’s suppose that can agree that different courses ought to assess and different levels and we can (for the moment at least) agree to use Anderson and Krathwohl’s taxonomy to define educational objectives. How much emphasis should different levels of courses emphasize the various levels of cognitive ability? I will throw my thoughts out and see if anyone is listening… Let’s keep things a little simpler and reduce the six cognitive levels to three (low, medium, and high as shown in the graphic for this post). Here is one suggestion for balance at different course levels:

  • 100-level courses – 60% low, 30% medium, 10% high
  • 200-level courses – 40% low, 40% medium, 20% high
  • 300-level courses – 30% low, 45% medium, 30% high
  • 400-level courses – 10% low, 50% medium, 40% high
  • Graduate course – 0% low, 50% medium, 50% high

Admittedly, this is a somewhat arbitrary system, but here is my rationale. Lower division courses are primarily concerned with introducing a body of knowledge and this will be most directly assessed at the lower end of the cognitive spectrum. The mid-level courses (200 and 300) tend to build upon this knowledge base and require more problem solving and critical analysis. These are best assessed in the middle of the cognitive spectrum. The upper division courses include capstone experiences and ought to focus primarily at the upper end of the cognitive pyramid. What do you think of this schemata?

This is all well and good. However, the devil is in the details. I plan to spend the next several weeks talking about some of the specifics: types of assignments, types of assessments, strengths and weaknesses, etc. After all – this blog is supposed to be about assessment in action. See you then.

Posted in Books, General interest, Website | Tagged , , , | 2 Comments