Search This Blog

Feb 24, 2017

The psychometrics of simplicity

I am not a psychometrician, so my friends who actually are will probably laugh at me. It’s OK, bring it on. Lack of expertise has never stopped anyone from expressing an opinion. I just want to make a case for simple instruments against complex instruments, in the context of teacher preparation.

Please take a look at an observation form I helped design, with my colleagues at University of Northern Colorado. It was years ago, and I still like it. And take another look at the nine page, 32-items form COE at Sac State currently uses. It is very good, clearly worked on for years, but is still too long. And just for kicks, here is the document on 77 pages, describing the Danielson framework, perhaps the most dominant teacher evaluation platform in the country.

The longer, more detailed rubrics are, in theory, more reliable. They do not just name a domain, but contain specific behaviors or other observable indicators that are associated with skills or competencies. Students either answer questions, or not. A teacher either has stated learning objectives or not, etc. The short rubrics I like tend to be holistic, more subjective, and more difficult to justify. 

However, the context of use is everything. Those are not laboratory instruments. Supervisors and cooperating teachers use them in the field, where they observe someone’s lesson. These are situations where you have to keep your eyes really open for tiny nuances of interactions, and at the same time one has to go through a long checklist. It’s very basic: the observer runs out of brain resources.

What we have noticed for ages is that data coming from longer rubrics tends to be flat, uninteresting. If you have a four-point scale, everyone will be about at 3 in the beginning, and 3.5 at the end of student teaching. It is because human being are unable to make multiple evaluating decisions over short periods of time. Supervisors and cooperating teachers tend to make up their mind holistically about the way someone is teaching, and then simply justify their overall impression through the rubric. Most of them are experienced, wise people. What sets an expert apart from a novice is exactly the ability to make non-analytical, synthetic, holistic judgments. One can debate whether their image of good teaching is accurate, but that is how they form opinions. Novices go through check-lists, because they are not yet able to quickly synthesize. So we force experts behave like novices.

It is often done on the premise that an evaluation rubric is a pedagogical instrument, and that it intends to remind pre-service teachers about what is important. But I am not sure if the argument works. We should encourage our novice teachers to develop the ability to think holistically, to synthesize knowledge. The checklists create the false impression that if you only do all those things, you will teach well. Well, either the checklist has to be a hundred pages long, or it should not exist. There are just too many possible indicators. An isolated action does not have meaning outside of the relational context. Yes, as a rule one should not give long lecture to six graders. But man, I have seen such brilliant exceptions. A hostile classroom atmosphere is not always the fault of the teacher, and therefore, not a reflection on his or her skills. Etc., etc., etc. For a hundred page checklist we can provide a thousand page list of exceptions.

Another consideration is economic. For something like Danielson-inspired instrument to work, one needs significant resources committed to constant training and retraining of evaluators to ensure the inter-rater reliability. If you have a large teacher preparation program like ours, it is almost impossible to do. Supervisors are many, and they change often, cooperating teachers are a multitude, and they are busy and change constantly. Whatever precious resources we have are better spend on PD at higher levels, for example, on co-teaching models or on cognitive coaching. Training them to use the rubrics correctly feels like a waste of time.

With a short holistic rubric, we embrace the strength of holistic assessment, and avoid the negative side of indicator-rich instruments. One can easily keep in mind the four-five main domains, and give an honest expert opinion on how a teacher candidate is doing. The shorter rubrics also give more time for qualitative feedback, which is always more important. You have the time to write “pay attention how you move around the classroom; it may be distracting children” or “some children did not understand the assignment,” or something like this, because you don’t have to run through 45 indicators. We also do not observe a good number of items at all, because they are not all evident on every lesson. But we feel compelled to enter some random number, so the cell is not empty.



In my opinion, it is much better to have better subjective data than poor objective data. It is especially true because the indicator-based, objective and detailed rubrics are not really validated by research, contrary to what Danielson and others claim. In other words, we do not really know that if a student teacher have written, for example, the unit learning outcomes as “related to “big ideas” of the discipline,” that it will really help kids learn. We may have a professional consensus about it, but we do not know it for a fact. The studies on value-added measures of teaching effectiveness are in their infancy. And even theoretically we are unable dis-aggregate the teacher behavior to small indicators to show the relative weight of, say communications style vs. mastery of material vs. the careful planning of instruction. Underneath all the sophistication - is the same gut feeling that we acquire with experience. OK, it is a collective gut feeling, but professionals were known to be wrong collectively. Just to remind the hard-line psychometricians: the semantic hypothesis is still a hypothesis.

Another practical consideration for short holistic rubrics is this: teacher preparation programs do not have time to look at all data we generate. The more items you have, the more work it takes to process and interpret data. The fewer are the data points, the better it is to comprehend. Data usually supports or contradicts suspicions we already have. It cannot do much more with technologies we have today. When we develop AI, the neural network technology, let’s talk again. For now, we may be better off admitting that we use very limited data collection techniques and our dreams of data-informed continuous improvement process may be a bit premature. So we need to bring the expectations to where the technology is, otherwise we produce a lot of needless work and unprocessed data.

No comments:

Post a Comment