Sign up & Download
Sign in

ConMap: Investigating New Computer-Based Approaches to Assessing Conceptual Knowledge Structure in Physics

by Ian D Beatty
Joint Winter Meeting of the American Association of Physics Teachers and the American Astronomical Society (2000)

Cite this document (BETA)

Available from Ian Beatty's profile on Mendeley.
Page 1
hidden

ConMap: Investigating New Computer-Based Approaches to Assessing Conceptual Knowledge Structure in Physics

ConMap:
Investigating New Computer-Based Approaches
to Assessing Conceptual Knowledge Structure
in Physics
A Dissertation Presented
by
Ian D. Beatty
Submitted to the Graduate School of the
University of Massachusetts Amherst in partial fulfillment
of the requirements for the degree of
DOCTOR OF PHILOSOPHY
May 2000
Physics
Page 2
hidden

Page 3
hidden
© Copyright by Ian D. Beatty 2000
All Rights Reserved
Page 4
hidden

Page 5
hidden
CONMAP:
INVESTIGATING NEW COMPUTER-BASED APPROACHES
TO ASSESSING CONCEPTUAL KNOWLEDGE STRUCTURE
IN PHYSICS
A Dissertation Presented
by
IAN D. BEATTY
Approved as to style and content by:
__________________________________________
William J. Gerace, Chair
__________________________________________
Robert A. Guyer, Member
__________________________________________
Robert J. Dufresne, Member
__________________________________________
Allan Feldman, Member
__________________________________________
John F. Donoghue, Department Head
Physics and Astronomy
Page 6
hidden

Page 7
hidden
vii
Dedicated to
Lee David Beatty
1925 — 1998
Loyal, brave, and humble when it mattered most.
Page 8
hidden

Page 9
hidden
ix
Acknowledgements
I am deeply grateful for the assistance of my committee during the researching
and writing of this dissertation. Bob Guyer had an apparently infinite amount of
time available to discuss any aspect of the work, despite his considerable other
obligations. Without Bill Gerace’s conviction that the research in question did
have merit and that I was capable of carrying it out, I would have quit many
times over.
Page 11
hidden
xi
Abstract
CONMAP:
INVESTIGATING NEW COMPUTER-BASED APPROACHES
TO ASSESSING CONCEPTUAL KNOWLEDGE STRUCTURE
IN PHYSICS
MAY 2000
IAN D. BEATTY, B.S., UNIVERSITY OF MASSACHUSETTS AMHERST
Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST
Directed by: Professor William J. Gerace
There is a growing consensus among educational researchers that traditional
problem-based assessments are not effective tools for diagnosing a student’s
knowledge state and for guiding pedagogical intervention, and that new tools
grounded in the results of cognitive science research are needed. The ConMap
(“Conceptual Mapping”) project, described in this dissertation, proposed and
investigated some novel methods for assessing the conceptual knowledge
structure of physics students.
A set of brief computer-administered tasks for eliciting students’ conceptual
associations was designed. The basic approach of the tasks was to elicit
spontaneous term associations from subjects by presenting them with a prompt
term, or problem, or topic area, and having them type a set of response terms.
Each response was recorded along with the time spent thinking of and typing it.
Several studies were conducted in which data was collected on introductory
physics students’ performance on the tasks. A detailed statistical description of
the data was compiled. Phenomenological characterization of the data
(description and statistical summary of observed patterns) provided insight into
the way students respond to the tasks, and discovered some notable features to
guide modeling efforts. Possible correlations were investigated, some among
different aspects of the ConMap data, others between aspects of the data and
students’ in-course exam scores. Several correlations were found which suggest
that the ConMap tasks can successfully reveal information about students’
knowledge structuring and level of expertise. Similarity was observed between
data from one of the tasks and results from a traditional concept map task.
Page 12
hidden
xii
Two rudimentary quantitative models for the temporal aspects of student
performance on one of the tasks were constructed, one based on random
probability distributions and the other on a detailed deterministic representation
of conceptual knowledge structure. Both models were reasonably successful at
approximating the statistical behavior of a typical student’s data.
Page 15
hidden
xv
List of Figures
Figure 1.1: Graphic depiction of a physics expert’s declarative knowledge store, showing
conceptual, operational/procedural, and problem-state knowledge (Gerace,
Leonard et al. 1997)............................................................................................................................. 5
Figure 1.2: Example of a (partial) semantic network for concepts related to “friction”. ............... 7
Figure 1.3: Interrelationships between the design of an experimental probe, the data
gathered and the model constructed. ........................................................................................... 13
Figure 2.1: Term entry dialog box for the FTE task............................................................................. 17
Figure 2.2: Term entry dialog for the TPTE task. ................................................................................ 19
Figure 2.3: Term entry dialog for the PPTE task.................................................................................. 20
Figure 2.4: Prompt term and timer dialog for HDCM task. .............................................................. 20
Figure 3.1: Start time differences vs. term index for study p151s99, subject 01 on task
J1_FTE. ................................................................................................................................................ 36
Figure 3.2: Thinking time vs. term index for study p151s99, subject 01 on task J1_FTE. ............ 36
Figure 3.3: Typing time vs. term index for study p151s99, subject 01 on task J1_FTE................. 37
Figure 3.4: Histogram of the natural logarithms of the start time differences for study
p151s99, subject 01 on task J1_FTE. ............................................................................................... 37
Figure 3.5: Histogram of the natural logarithms of the thinking times for study p151s99,
subject 01 on task J1_FTE. ............................................................................................................... 38
Figure 3.6: Histogram of the natural logarithms of the typing times for study p151s99,
subject 01 on task J1_FTE. ............................................................................................................... 38
Figure 3.7: Quantile plot of thinking time logarithms for study p151s99, subject 01 on task
J1_FTE. ................................................................................................................................................ 40
Figure 3.8: Histogram of logarithms of thinking times for subject p151s99-01 on the J1-
FTE task, with best-fit curve for normal (Gaussian) probability density function,
normalized for total number of counts......................................................................................... 40
Figure 3.9: Quantile plot of logarithms of thinking times for subject p152f97-14 on FTE
task, with best-fit curve for normal (Gaussian) cumulative distribution function. ............. 42
Figure 3.10: Histogram of logarithms of thinking times for subject p152f97-14 on FTE
task, with best-fit curve for normal (Gaussian) probability density function,
normalized for total number of counts......................................................................................... 43
Page 17
hidden
xvii
Figure 3.30: Same as Figure 3.28, for subject p151s99-11. .................................................................. 64
Figure 3.31: Success rate vs. threshold time for jump/non-jump prediction, for subject
p151s99-01 on task J1_FTE (174 terms entered). ......................................................................... 66
Figure 3.32: Same as Figure 3.31, for subject p151s99-11 (67 terms entered). ................................ 66
Figure 3.33: Term thinking time vs. term start time (relative to beginning of task) for
subject p151s99-01 on task J1_FTE. Symbols indicate expert’s classification of terms
(see text).............................................................................................................................................. 69
Figure 3.34: Subject FTE jump rate vs. Exam 2 score for p152f97 study. ........................................ 72
Figure 3.35: HDCM node counts vs. subject’s course exam performance, for task
H2_HDCM of p151s99 study. The best-fit line and coefficient of correlation are
indicated. ............................................................................................................................................ 79
Figure 3.36: HDCM link counts vs. subject’s course exam performance, for task
H2_HDCM of p151s99 study. The best-fit line and coefficient of correlation are
indicated. ............................................................................................................................................ 80
Figure 3.37: Ratio of link count to node count vs. subject’s course exam performance, for
task H2_HDCM of p151s99 study. The best-fit line and coefficient of correlation are
indicated. ............................................................................................................................................ 80
Figure 3.38: Histogram of response counts for all TPTE tasks/prompts in the p151s99
study. Total number of counts was 832........................................................................................ 85
Figure 3.39: Mean TPTE response list term counts by subject (p151s99 study). Error bars
indicate the standard deviation of the set of term counts......................................................... 85
Figure 3.40: Mean number of response terms for all TPTE prompts vs. total of course
exam scores, by subject for p151s99 study. Numbers for data points indicate subject.
Coefficient of correlation r = 0.313, where 0.13 is the significance threshold for 16
data points.......................................................................................................................................... 86
Figure 3.41: Mean number of response terms by session, averaged across subjects and
prompts. Error bars indicate standard deviations. (Session 1 = “Session A”, etc.).............. 86
Figure 3.42: TPTE similarities by subject. Dashes represent similarity values for one
response list; dots represent subjects’ mean similarity values, with error bars
indicating the associated standard error. ..................................................................................... 93
Figure 3.43: Subjects’ mean TPTE response similarities vs. overall exam performance. The
data markers are numbers indicating which subject is represented. Error bars
indicate the associated standard errors. The best-fit line from which the coefficient of
correlation (Pearson’s r-value) was calculated is indicated...................................................... 93
Figure 3.44: Subjects’ median TPTE response similarities vs. overall exam performance.
The data markers are numbers indicating which subject is represented. The best-fit
line used for calculating the coefficient of correlation (Pearson’s r-value) is indicated...... 94
Figure 3.45: Similarity value vs. prompt, for task J2_TPTE of study p151s99. Blue markers
indicate values for individual subjects; red points indicate mean across subjects,
with error bars indicating standard error of the mean. Table 3.13 gives the mapping
between prompt number and prompt term. ............................................................................... 95
Page 18
hidden
xviii
Figure 3.46: TPTE response similarities vs. session number. Dashes represent similarities
for individual response lists. Dots with error bars represent the mean of all
similarities for the session, with standard error. The best-fit line is indicated. .................... 97
Figure 3.47: P151s99 B1_TPTE scores for “force” vs. course exam performance. The data
point markers are numbers indicating the subject represented by the point. The best-
fit line is shown. .............................................................................................................................. 101
Figure 3.48: Same as Figure 3.47, but for task C2_TPTE. ................................................................. 101
Figure 3.49: Same as Figure 3.47, but for task J2_TPTE.................................................................... 102
Figure 3.50: Same as Figure 3.47, but for the average of the three session scores for each
subject. .............................................................................................................................................. 102
Figure 3.51: Histogram of response counts for all PPTE tasks and prompts in the p151s99
study. Total number of counts was 639...................................................................................... 111
Figure 3.52: Mean PPTE response counts by subject (p151s99 study). Error bars indicate
the standard deviation of the set of term counts. ..................................................................... 111
Figure 3.53: Subjects’ mean PPTE response count against mean TPTE response count,
with standard errors in the means (not standard deviations) indicated, for all
prompts and sessions of the p151s99 study............................................................................... 112
Figure 3.54: Mean number of response terms by session, averaged across subjects and
prompts. Error bars indicate standard deviations. (Session 1 = “Session A”, etc.)............ 113
Figure 4.1: Timeline of FTE entries for study p151s99, subject 01 on task J1_FTE...................... 136
Figure 4.2: Timeline for synthetic data set generated from random distribution model,
with model parameters taken to match J1_FTE data of subject p151s99-01........................ 137
Figure 4.3: Same as Figure 4.2, with a different seed to the random number generator. .......... 137
Figure 4.4: Comparison of two candidate recall functions with the recall function derived
for a log-normal distribution of thinking times. Curves for the candidate functions
are the result of a chi-squared fit, with best-fit parameters shown....................................... 144
Figure 4.5: Probability distribution function r(z) for logarithms of thinking times
generated by power-law recall function, for a range of values of the parameter η and
for β = 1. ............................................................................................................................................ 145
Figure 4.6: Thinking time quantile plot for model-generated data, using N = 300, γ = 75,
α = 15, and β = 1. The best-fit normal (Gaussian) CDF to the thinking time
logarithms is shown. ...................................................................................................................... 147
Figure 4.7: Thinking time quantile plot for subject p151s99-01 on task J1_FTE, with best-
fit normal (Gaussian) CDF. ........................................................................................................... 147
Figure 4.8: Histogram of logarithms of thinking times for the model data displayed in
Figure 4.6, with PDF curve for the best-fit normal distribution. ........................................... 148
Figure 4.9: Histogram of logarithms of thinking times for the subject data displayed in
Figure 4.7, with PDF curve for the best-fit normal distribution. ........................................... 148
Figure 4.10: Timeline for model-generated data of Figure 4.6........................................................ 149
Page 19
hidden
xix
Figure 4.11: Timeline for subject p151s99-01 data on task J1_FTE. ................................................ 150
Figure 4.12: Term entry rate vs. number of terms entered for model-generated data of
Figure 4.6. ......................................................................................................................................... 150
Figure 4.13: Term entry rate vs. number of terms entered for subject p151s99-01 on task
J1_FTE. .............................................................................................................................................. 151
Page 20
hidden

Page 24
hidden
2implicit in standard test theory… is incompatible with the view rapidly emerging
from cognitive and educational psychology. Learners increase their competence
not only by simply accumulating new facts and skills, but by reconfiguring their
knowledge structures, by automating procedures and chunking information to
reduce memory loads, and by developing strategies and models that tell them
when and how facts and skills are relevant. The types of observations and the
patterns in data that reflect the ways that students think, perform, and learn
cannot be accommodated by traditional models and methods.
Furthermore, traditional problem-based exams provide little information
about why a particular student failed on the particular problems he or she did not
get right, and even less about what specific pedagogic interventions the
instructor should employ to help the student resolve their difficulties. There are
many reasons why a student might fail to solve a problem correctly. Among
them:
 Failure to interpret the problem situation as the assessor intended;
 Insufficient or incorrect physical intuition to understand what is
happening in the problem situation;
 Ignorance of the necessary principle;
 Failure to recognize the correct principle;
 Conceptual mistake during application of the correct principle;
 Cognitive overload: general confusion and failure to keep track of
enough information and lines of thought;
 Algebraic error during calculations;
 Numerical error or calculator keypress error;
 Error determining units or powers of ten;
 Failure to answer the precise question being asked.
If a student’s written solutions are hand-graded, it might be possible to identify
the point at which the student went awry. Even then, the reason for the mistake
— the underlying misconception, missing piece of information or insight, etc. —
can only be guessed at. This is complicated by the fact that students frequently
make “careless errors” caused by the failure to apply knowledge they have,
rather than by missing knowledge.
Page 28
hidden
6Expert Novice
Store of domain-specific knowledge Sparse knowledge set
Knowledge richly interconnected Knowledge mostly disconnected and
amorphous
Knowledge hierarchically structured Knowledge stored chronologically
Integrated multiple representations Poorly formed and unrelated
representations
Good recall Poor recall
Table 1.1: Summary of the main differences between experts’ and novices’
declarative knowledge characteristics, from expert-novice problem-solving
studies (Gerace, Leonard et al. 1997).
1.2.2. Research on Cognitive Modeling
A variety of knowledge models have been constructed by the cognitive
science community, for such domains as physics, computer programming, chess,
land navigation, maintenance of aircraft hydraulic systems, and electronic circuit
design (Chipman, Nichols et al. 1995). These models tend to fall into two distinct
categories: network models of declarative knowledge and rule-based models of
procedural knowledge.
Semantic networks are the classic network model of knowledge (Bara 1995),
originally developed for the analysis of natural language. A semantic network
consists of a set of nodes interconnected by labeled, oriented links. The nodes
typically represent concepts. The links represent relationships between the
nodes, with a label describing the nature of the relationship. The links are
oriented so that asymmetric relationships may be described. Figure 1.2 displays
an example of a small semantic network.
The primary difficulty with semantic networks as knowledge models is that
they are tremendously general: without specific rules to limit the labels allowed
on links, almost any degree of conceptual complexity can be buried within a
creative choice of label. Consider, for example, the number and subtlety of
concepts subsumed in the simple declarative statement “Newton’s laws describe
force” from Figure 1.2. Nevertheless, semantic networks do capture the general
idea that declarative knowledge, or at least the conceptual part of it, gets its
meaning from the interrelationships between conceptual elements.
Artificial neural networks (ANNs) form a class of models superficially similar
to semantic networks, but with a very different origin and intent (McClelland
and Rumelhart 1986; Hertz, Krogh et al. 1991; Watkin and Rau 1993). Inspired by
the neurobiological functioning of the brain, ANNs were developed as an
Page 35
hidden
13
models of larger domains, such as physics problem solving, seems to be the
numerical complexity of the calculations required for the Bayesian inference.
Much of the research surrounding such assessment methods aims to find
algorithms for reducing the complexity of the calculations to manageable levels,
and as affordable computers become ever more powerful, this limitation should
recede.
1.3. Meet the Need
It has been argued that physics education needs new assessment methods,
and that new assessments must be based on better models of physics learning
and expertise than are currently available. It is also true that the development of
better models depends upon data derived from appropriate assessment
techniques. This is the same “chicken and egg” problem as confronts physics
research: theory both depends upon and guides experiment, as depicted in
Figure 1.3.
probe
design
model
data
suggests
interprets
gu
ide
s
provides
validates
Figure 1.3: Interrelationships between the design of an experimental probe, the
data gathered and the model constructed.
In mature research fields, focusing on one stage at a time — designing a
better experimental probe, for example, or revising a model — is possible and
generally recommended. In a newly emerging field, however, wherein
researchers are still struggling to determine what the measurable and modelable
quantities might be and how they ought to be represented, the three stages
cannot be separated so cleanly. One’s model of the system being studied, as
preliminary and vague as it might be, guides the design of an experimental
Page 37
hidden
15
2. ConMap Design
Section 2.1 of this chapter presents the design of the ConMap tasks, the
experimental “probes” being investigated for their utility for assessing physics
students’ knowledge structures. Section 2.2 describes the ConMap studies, in
which the designed tasks were presented to physics students and task results
were gathered. Section 2.3 presents some reflections on the task and study
designs, based on the administrators’ experiences conducting the studies (not on
analysis of the resulting data).
2.1. ConMap Tasks
2.1.1. Design Objectives
As stated in Chapter 1, the intent of the ConMap project is to investigate the
utility of a particular set of proposed assessment tools — brief, computer-
administered tasks for eliciting spontaneous conceptual associations — for
probing the quality and extent of a physics student’s conceptual knowledge
structure (CKS) in an introductory physics domain.
Ultimately, an ideal assessment task would provide a complete “map” of a
student’s knowledge structure, indicating connections that ought to be present
for expert-like knowledge but were not, and connections that existed but should
not (misconceptions). With such a student-specific map, an instructor could
design specific pedagogic activities to benefit that student. More modestly and
immediately, one might hope to develop assessment measures which
characterize general qualities of a student’s knowledge structure: how extensive
it is, whether it is richly or sparsely interconnected, whether the organizational
scheme is haphazard or systematic, and so on. The ConMap tasks were designed
with both the ultimate and immediate goals in mind.
Why elicit conceptual associations? As discussed in Subsection 1.2.1, it is
believed that a major component of domain expertise in physics is the possession
of a richly interconnected, hierarchically organized network of associated
concepts. A primary goal of physics instruction is to facilitate a learner’s
development of such a conceptual “understanding”. ConMap tasks were
therefore designed to elicit, as directly as possible, information on a subject’s
possession of concepts and inter-concept associations.
Why elicit spontaneous conceptual associations? It was desired that the tasks
probe conceptual associations that were “readily accessible” to the subject, in the
belief that such associations represent the automated knowledge inherent in the
Page 39
hidden
17
 second
 impulse-momentum theorem
 problem-solving
Statements like “energy is conserved in an elastic collision” were not
considered to be terms, but rather propositions involving multiple terms and
their relationship. “Conservation of energy”, on the other hand, would be
accepted as a term, since it serves as a name for a physics concept. In practice, the
line between single-concept terms and compound statements of relationship is
not well-defined, and subjects frequently wandered dismayingly far over it.
2.1.3. The Specific Tasks
The ConMap tasks were developed with the intention of probing the set of
terms and inter-term associations that constitute the conceptual portion of a
subject’s declarative knowledge store for a domain. Each task was intended to
elicit a somewhat different aspect of that knowledge store. The following
subsubsections describe the tasks that were developed and investigated.
2.1.3.1. Free Term Entry (FTE)
For the Free Term Entry (FTE) task, subjects are given a general topic area like
“introductory mechanics” or “the material covered in Physics 151”. They are
asked to think of terms that they associate with this topic area, spontaneously
and without strategy, and to type these terms into a dialog box (shown in
Figure 2.1) as the terms come to mind. When each term is completed, the subject
presses the “return” key on the keyboard (equivalent to clicking on the “Enter”
button in the dialog box), and the typed term disappears, leaving the typing box
empty and ready for a new term.
Figure 2.1: Term entry dialog box for the FTE task.
Page 40
hidden
18
The data gathered consist of the list of terms in the order they were entered,
together with the time at which typing began for each term (the moment at
which the first character was typed into an empty field), and the time at which
each term was entered (the moment at which the return key or “Enter” button
was pressed). The task runs for a specified total duration, typically 20 to 45
minutes, before terminating.
This task was intended to explore the space of terms constituting a subject’s
active vocabulary of concept-describing terms for the topic area, without
influencing the responses by providing terms through external prompting. The
result was conceptualized as a kind of “random walk” through the space of a
subject’s active vocabulary. It was hoped that the duration of pauses between
term entries, and the grouping of term entries into clusters separated by longer
fallow periods, might reveal some information about what terms a subject
associates closely. Since the list of terms and times comprising a FTE data set
forms a one-dimensional series, and the space of conceptual knowledge elements
and their interconnections requires two dimensions to represent (for example, as
a matrix of connection strengths), it was clear from the beginning that the FTE
task can never provide a complete probe of a subject’s conceptual
interconnections. Nevertheless, it was a first attempt at exploring the space. In
addition, it was hoped that overall statistical patterns in a subject’s FTE data
might reveal global aspects of his or her knowledge and cognition, perhaps
serving as bulk measures in much the way that thermodynamic quantities like
temperature and pressure characterize global statistical aspects of a collection of
microscopic particles.
2.1.3.2. Term-Prompted Term Entry (TPTE)
For the Term-Prompted Term Entry (TPTE) task, subjects are given a prompt
term. They are asked to think of terms they consider related to this prompt term,
spontaneously and without strategy, and to type these terms into a dialog box
(shown in Figure 2.2) as the terms come to mind. The prompt term stays visible
throughout, and typed terms disappear from view as they are entered. Data
gathered is the same list of terms, term start times, and term entering times as for
the FTE task. The process repeats for several different prompt terms.
Several schemes for terminating a subject’s entering of response terms have
been considered. In initial trials, a subject’s entering of response terms was
terminated the first time ten seconds of inactivity was detected while the term-
entry field was empty, on the assumption that this indicated the subject was
having difficulty thinking of another relevant term to enter. The task would also
be terminated after the tenth response term entry. For later trials, the task was
Page 41
hidden
19
terminated by the same criteria, except that the task would not terminate until at
least three terms had been entered.
The TPTE task was intended to explore subjects’ conceptual associations in a
more focused and directed way than the FTE task allows, eliciting the strongest
associations a subject has with a particular prompt term. A mode of operation
envisioned (but not implemented in any studies to date) was to first give a
subject the FTE task on a topic, and then use the set of responses gathered as
TPTE prompt terms to fill out a web of connections between those terms. It was
hoped that such a procedure might allow computer-based elicitation and
construction of a “concept map” representation of a subject’s knowledge of a
topic, in a manner more spontaneous and therefore presumably more genuine
than occurs for traditional hand-drawn concept map tasks.
Figure 2.2: Term entry dialog for the TPTE task.
2.1.3.3. Problem-Prompted Term Entry (PPTE)
The Problem-Prompted Term Entry (PPTE) task is identical to the TPTE task,
except that instead of being prompted with a term, subjects are directed to read
the description of a problem or problem situation. Subjects then enter terms they
associate with the problem in a dialog box (see Figure 2.3). The process is
repeated for several prompt problems. In all studies to date, prompt problems
have been provided on paper in a ring binder, and the computer program
implementing the task has instructed students when to turn the page and read a
new problem.
The data gathered is identical to that gathered in for the TPTE task.
The PPTE task was intended to explore the relationship between problem
solving and conceptual associations. By intent, most ConMap tasks target
conceptual knowledge structure and ignore other skills and knowledge types
relevant to problem-solving; the PPTE task is an exception in that it addresses the
interface between conceptual knowledge and problem-state knowledge.
Page 44
hidden
22
usefulness of the task would be greatly increased. No such basis currently exists
in the context of the ConMap studies, so the TPJ task has not included in any
studies yet conducted.
2.2. ConMap Studies
As part of the ongoing ConMap project, several studies have been
conducted, with varying population sizes, duration, task inclusion, and degree of
planning. Many of the “studies” were not intended to provide rigorous data for
full analysis, but rather to furnish preliminary data and experience as an aid to
the design of more reliable studies. Only two contained a large enough
population of subjects for serious analysis, and they provided most of the data
for Chapter 3.
The preliminary study was actually no more than a collection of test cases. It
consisted of the loose and informal presentation of various tasks, predominantly
the FTE, to various individuals of various backgrounds and levels of expertise, in
a nonsystematic way, under inconsistently-controlled conditions. The objective
was to test and debug tasks.
The Physics 119 Fall 1997 (p119f97) study drafted all 8 students from Physics
119/597T (introductory mechanics for prospective high-school physics teachers,
taught by Profs. William Gerace and Robert Dufresne). The study consisted of
one FTE task on “energy” given near the end of the course. As part of the course
itself, a HDCM task was given to students by the course instructors.
The Physics 152 Fall 1997 (p152f97) study selected 18 subjects for pay from a
pool of volunteers enrolled in Physics 152 (thermodynamics, electricity and
magnetism for engineers, taught by Prof. Jose Mestre). The study consisted of
one FTE task on the entire course domain, given at the end of the course.
The Physics 151/2 Summer 1998 (p15Xs98) study recruited five subjects for
pay from the students enrolled in the summer sessions of both Physics 151
(introductory mechanics for engineers) and Physics 152; one of the five recruits
did not complete the study. The study consisted of two sessions, each a battery of
multiple tasks. The first session was given between the end of Physics 151 and
the beginning of Physics 152, mostly on p151 material, with a “pre-test” FTE task
on p152 material. The second session was given at end of Physics 152, on p152
material. An interview with subjects was conducted and recorded after each
session.
The Physics 151 Spring 1999 (p151s99) study selected sixteen subjects for pay
from a pool of volunteers taking Physics 151 (taught by Prof. Jose Mestre). The
study consisted of ten weekly sessions during the semester, each of fifteen
minutes duration, except for the last which lasted 1.5 hours. A variety of tasks
Page 47
hidden
25
were given instructions for the HDCM task but did not actually draw a map, to
save time during the next session.
 Task 1: TPTE (food, travel, democracy, tree, acceleration, vector)
 Task 2: PPTE
Session B (starting 3/09) was intended to get some basic TPTE responses,
mostly on kinematics. A HDCM was given to serve as a basis of comparison with
an end-of-course HDCM using the same topic area, and to compare with TPTE
data.
 Task 1: TPTE (displacement, force, free-fall, energy, acceleration, graph)
 Task 2: HDCM (force)
Session C (starting 3/23) presented TPTE and PPTE tasks focused on forces
(previously covered in the course) and work and energy (being covered at that
time). One PPTE prompt problem was presented as a “problem situation”
without a question, to investigate how the presence or absence of a question
impacts subjects’ responses.
 Task 1: PPTE
 Task 2: TPTE (energy, force, inclined plane, equilibrium, reaction force,
work, normal force)
Session D (starting 3/30, the week of Exam 2) was a follow-up to Session C,
using many of the same prompts, to see how course coverage of the material and
studying for the exam impacted task results. The “problem situation” from
Session C was presented with a question, and another problem from Session C
was presented as a situation without a question.
 Task 1: PPTE
 Task 2: TPTE (conservative, inclinded plane [sic], equilibrium, reaction
force, work, spring, normal force)
Session E (starting 4/06) included as PPTE prompts two problems given on
the second course exam, to compare PPTE responses to exam performance. TPTE
problems were primarily drawn from momentum ideas, which the course was
beginning its treatment of.
Page 49
hidden
27
 Interlude 1: Group Interview
 Task 2: TPTE (inclined plane, conservative, rotation, vector,
displacement, energy, force, graph, spring, free-fall, friction, velocity)
 Interlude 2: Profile Questionnaire
 Task 3: PPTE
 Task 4: HDCM (force)
 Task 5: HDCM (momentum) [only for some subjects]
Most of the student contact required by this study — contacting and
scheduling subjects and administering sessions — was done by Dan Miller,
another graduate student.
2.2.3. Physics 172 Spring 1999 (p172s99) Study
The purpose of this study was to give a subset of the tasks and prompts from
the p151s99 to subjects expected to have more highly structured knowledge, in
order to look for a signature of that structuring in the data.
Volunteers were solicited from the subset of students taking Physics 172
during Spring 1999 that had taken Physics 171 the previous semester. Physics 171
and 172 are the first two semesters of the introductory physics sequence for
physics majors; 171 covers mechanics, and 172 covers thermodynamics, waves,
fluids, and miscellaneous other topics. Physics 171 had been taught by Prof.
William Gerace, and 172 was taught by Prof. Ross Hicks. No financial
compensation was offered. Prof. Gerace chose five subjects from the pool of
volunteers, attempting to get a reasonable distribution of ability levels based on
his recollection of each student’s overall performance in 171.
This particular population of students was targeted because in the Physics
171 course, Prof. Gerace strongly and explicitly emphasized the structuring of
conceptual knowledge to students. It was hoped that this might leave an
observable signature in subjects’ ConMap task data. The study was conducted
during the middle and end of the subsequent physics course due to constraints
on the scheduling of the study, not for any specific research purpose.
Two sessions were held. The first was of about one-half hour duration,
within a week of 4/22/99. The second was held during the final week of classes
and lasted for approximately 1.5 hours.
Session A (starting 4/22) presented TPTE, PPTE, and HDCM tasks to
subjects, using a subset of the prompts given in the p151s99 study.
Page 50
hidden
28
 Task 1: TPTE (food, democracy, “big ideas” of mechanics, acceleration,
force, inclined plane, energy, equilibrium, graph, momentum, collision,
work, friction)
 Task 2: PPTE
 Task 3: HDCM (force)
Session B (starting 5/13) was nearly identical to Session J of the p151s99
study. Some additional TPTE terms drawn from Physics 172 course material
were added for contrast, and the HDCM prompt was changed so that the two
HDCM tasks given in the p172s99 study used different prompts.
 Task 1: FTE ("the material covered in Physics 171 last fall")
 Interlude 1: Group Interview
 Task 2: TPTE (inclined plane, conservative, rotation, vector,
displacement, energy, force, graph, spring, free-fall, friction, velocity,
wave, gravity, sound, light)
 Interlude 2: Profile Questionnaire
 Task 3: PPTE
 Task 4: HDCM (momentum)
Student contact for this study was also handled by Dan Miller.
2.3. Reflections on the Administration of Tasks
This section describes some of the difficulties encountered during
administration of the tasks. Weaknesses of the task and study designs that were
revealed during data analysis will be discussed in Chapter 3.
When administering term-entry tasks (FTE, TPTE, and PPTE), it was
occasionally necessary to remind subjects to restrict themselves to physics terms.
Subjects were sometimes inclined to include terms from chemistry, biology, or
everyday experience. A small number of non-serious “joke” terms were entered.
Some subjects included terms descriptive of the course and instructor as a whole
rather than of the subject matter, especially during FTE tasks. Clearer, more
explicit instructions with examples and counterexamples might be useful in this
regard, but a design decision was made not to provide subjects with any
Page 52
hidden
30
When subjects are given a PPTE task, it is difficult to control how long they
spend considering the problem before beginning to enter response terms. Since
subjects’ reading speeds and the complexity of the prompt problems varied
significantly, it was difficult for an observer to estimate how contemplative
subjects might be during the reading phase. It was not intended that subjects pre-
think their responses at all, but some time was clearly necessary to “digest” the
problem. This difficulty may be unavoidable when using complex prompts for
spontaneous association tasks.
A related PPTE task complication arises when subjects wish to pause in their
entry of response terms and remind themselves of some aspect of the prompt
problem by looking back at it. Such a desire seems reasonable, given that subjects
have been instructed to keep their responses relevant to the prompt problem and
that they are unlikely to keep every detail of the problem in mind after one
reading. For the TPTE task, the prompting term is kept visible and prominent
directly above the term-entry box, and subjects are expected to glance at it
frequently to re-focus themselves. For the PPTE task, however, a subject’s re-
reading of the problem can introduce a significant and difficult-to interpret
thinking time into the data, or even cause the task to terminate. This problem
might also be unavoidable when using a complex prompt for a spontaneous but
constrained association task.
When carrying out the HDCM task, some subjects demonstrated a
misunderstanding about how maps were supposed to be drawn. Some of the
maps drawn had more than one node containing the same term, presumably
either through forgetfulness or as a convenience for the mapmaker. Some maps
had branching links which connected more than two nodes together. Subjects
had read brief written instructions on how to draw a proper map, and had been
shown an example map for a non-physics topic. More explicit instructions and
training are apparently necessary.
Another difficulty which occurred with the HDCM task, and to a lesser
extent with term entry tasks, was that subjects sometimes entered incomplete
terms whose meaning was only clear from the context in which the fragment
appeared. For example, a concept map might have “kinetic” and “static” as
nodes connected to a node for “friction”, and might have other nodes for
“kinetic” and “potential”, connected to “energy”. Technically, this is a case of
duplicate nodes, since two nodes both contain the term “kinetic”. However, from
their context, the two nodes clearly refer to different concepts: “kinetic friction”
and “kinetic energy”. Most such term fragments could be completed with ease
by a domain expert during data analysis, but would pose a significant problem
for computerized analysis procedures and for integrated ConMap systems which
might, for example, use terms harvested from one task as prompts in another.
Page 53
hidden
31
Better instruction and training of subjects might reduce the incidence of the
problem, but would probably not eliminate it entirely, since subjects may not
realize their term fragments are ambiguous.
For the long-duration FTE and HDCM tasks, subjects were intended to
concentrate on the task until time expired, even if that required them to search
their minds for minutes at a time to think of additional terms or map
elaborations. Sometimes, however, subjects appeared to relax and cease working
on such a task before time had run out, as if they had decided they had nothing
more to add. This is perhaps not a very serious problem, since a subject inclined
to make that decision might not have entered much more had they remained on
task. The more general issue of subject attention and distraction plagues all
studies in which subjects are required to concentrate for extended periods of
time, and may be unavoidable.
For the most part, the problems noted during administration of the tasks
were not major and did not appear to impact the data seriously enough to
prevent the preliminary analysis intended. The one exception is subjects’
exploitation of the TPTE/PPTE task termination loophole, which introduced a
nontrivial number of spurious data points into the timing data. Most of the
problems should be addressable through improved instruction and training
procedures in future studies, and the remainder are probably unavoidable
consequences of the natures of the tasks themselves.
Page 56
hidden
34
motivation of the ConMap product was the belief that traditional, problem-based
exams serve as poor indicators of knowledge structure; exam grades are
therefore not expected to correlate more than weakly with interesting measures
from the ConMap data. While some attempts were made to compare ConMap
measures against course exam performance, strong and compelling results were
not expected.
It should be possible to construct exams or other instruments to probe
knowledge content and structure more effectively than traditional exams; for
research purposes, these instruments would not need to be constrained by
standard course requirements for practicality (amount of student or evaluator
time required, for example). Such instruments could in principle be used to
validate ConMap-based assessments, although none were designed into the
ConMap studies. Future studies should rectify this shortcoming.
The remainder of this chapter presents analysis of data from the ConMap
studies. Each of the chapter’s sections addresses one type of ConMap task. Data
from the task are described and summarized, and phenomenological analysis of
the data is presented. With data from the FTE, TPTE, and PPTE tasks, the in-
depth investigation of some specific hypotheses is discussed. For all tasks,
suggestions for follow-up studies are made, including recommendations for
design changes to rectify inadequacies discovered in the present study’s design.
3.1. Free Term Entry (FTE) Data Analysis
As described in Subsubsection 2.1.3.1, for a Free Term Entry (FTE) task,
subjects are given a target domain like “introductory mechanics” or “the material
covered in Physics 152”, and asked to type into a dialog box terms that they
associate with the domain, one term at a time, pressing the return key after each
term. Each term disappears when they press the return key. Subjects are asked to
enter the terms in the order they think of them, as close as possible to the time
they think of them, with minimal disruption of their train of thought.
Section 3.1 analyzes the data from the FTE component of the p151s99,
p152f97, and p172s99 studies. Subsection 3.1.1 takes a phenomenological
approach, describing observable statistical features of the data. Subsection 3.1.1
addresses the specific question of whether the amount of time subjects spend
thinking before entering a term is correlated with how related that term is to
immediately previous terms. Subsection 3.1.3 investigates whether a correlation
exists between subjects’ in-course exam scores and the frequency with which
their FTE response terms are apparently unrelated to immediately previous
terms. Subsection 3.1.4 summarizes the findings.
Page 57
hidden
35
3.1.1. Phenomenological Description of Data
3.1.1.1. Raw Data
The raw data captured for each subject on each FTE task is a list of the terms
entered, in the order entered. Along with each term, the time at which the first
letter of the term was typed (start time), and the time at which the return key was
pressed to complete the term (enter time), are recorded. Times are determined by
the system clock of the computer presenting the task, and recorded to one
sixtieth of a second. For later analysis, the start time of the first entered term was
subtracted from these times, defining the “t = 0” point.
As an immediate data processing step, a typing time and thinking time are
calculated for each term. The typing time is the difference between the term’s
enter and start times, indicating how long the subject spent typing the term. The
thinking time is the difference between the term’s start time and the previous
term’s enter time, indicating how much time passed between the two terms
while the subject was not typing. The thinking time for the first entered term was
defined to be zero.
Such data was collected and analyzed for each of three studies: p151s99 (16
subjects), p172s99 (5 subjects), and p152f97 (18 subjects). Data from one of the
p152f97 subjects was discarded because the subject clearly misunderstood the
task instructions and carried out the task in a way which made the data
meaningless.
3.1.1.2. Times as Random Variables
Consider a timeline to be a series of start times {t1, t2, …, tN} for the N term entries.
Define the series of time differences {∆t1, ∆t2, …, ∆tN} by ∆tn = tn – tn–1 and t0 = 0, so
that the time difference for a term is equal to the term’s thinking time summed
with the previous term’s typing time. Define a term’s index to be 1 if it was the
first term entered in a subject’s FTE response set, 2 if it was the second entered,
etc. Figure 3.1, Figure 3.2, and Figure 3.3 show start time difference vs. term
index, thinking time vs. term index, and typing time vs. term index respectively
for the data set of an example subject (p151s99 study, subject 01 on task J1_FTE).
The start time differences and thinking times appear randomly distributed
inside an envelope that increases with term index. The typing times do not tend
to increase significantly with term index. For all three plots, short times appear
more common than longer ones.
Disregarding for now the systematic trend of increasing times with term
index, the sets of start time differences, thinking times, and typing times can each
be analyzed as a set of uncorrelated values drawn from a random distribution,
and the nature of those distributions can be explored. For the same example data
Page 58
hidden
36
100
80
60
40
20
st
ar
t t
im
e
di
ffe
re
nc
e,

∆t
[s
ec
]
160140120100806040200
term index, i
p151s99-01 on J1_FTE
Figure 3.1: Start time differences vs. term index for study p151s99, subject 01 on
task J1_FTE.
100
80
60
40
20
0
th
in
ki
ng
ti
m
e,
tT
hi
nk
[s
ec
]
160140120100806040200
term index, i
Figure 3.2: Thinking time vs. term index for study p151s99, subject 01 on task
J1_FTE.
set as above, Figure 3.4, Figure 3.5, and Figure 3.6 show histograms of the natural
logarithms of the start time differences, thinking times, and typing times. The
natural logarithm of the times has been used rather than the times themselves
because short times are far more common than long times, and a linear scale that
included the longest times would lose detail for the short times. Although the set
Page 59
hidden
37
10
8
6
4
2t
yp
in
g
tim
e,
tT
yp
e
[se
c]
160140120100806040200
term index, i
Figure 3.3: Typing time vs. term index for study p151s99, subject 01 on task
J1_FTE.
35
30
25
20
15
10
5
0
co
u
n
t
543210
ln( ∆tStart / 1 sec)
∆tStart histogram
p151s99-01 on J1_FTE
Figure 3.4: Histogram of the natural logarithms of the start time differences for
study p151s99, subject 01 on task J1_FTE.
of typing times does not include as wide a range of times as does the set of
thinking times, the same logarithmic scale was used for consistency and ease of
comparison.
Page 60
hidden
38
25
20
15
10
5
0
co
u
n
t
543210-1-2
ln( tThink / 1 sec)
thinkT histogram
p151s99_01 on J1_FTE
Figure 3.5: Histogram of the natural logarithms of the thinking times for study
p151s99, subject 01 on task J1_FTE.
25
20
15
10
5
0
co
u
n
t
210-1
ln( tType / 1 sec)
typeT histogram
p151s99-01
on J1_FTE
Figure 3.6: Histogram of the natural logarithms of the typing times for study
p151s99, subject 01 on task J1_FTE.
The distributions of the logarithms of start time differences and thinking
times look generally normal, with a noticeable skew to the right. The thinking
time histogram has a pronounced spike to the left of its peak. The logarithmic
typing time distribution, on the other hand, has a slight tail to the left. The fact
that all three histograms are at least crudely normal indicates that the
Page 61
hidden
39
distributions are approximately log-normal, and justifies the decision to look at
the distributions of the logarithms rather than of the times themselves. (A
random variable obeys a log-normal distribution if its logarithm obeys a normal,
i.e. Gaussian, distribution.)
Because we expect thinking times and typing times to be the fundamental,
approximately independent quantities indicative of subjects’ mental
machinations during a FTE task, and because start time differences are
dependent quantities calculable from thinking and start times, the following
analysis will focus on thinking and typing times and not on start time
differences.
3.1.1.3. Thinking Time Distribution
Thinking time are interesting because they might plausibly provide
information about the cognitive process underlying a subject’s responses to the
FTE task. At the very least, a long thinking time probably indicates significant
cognitive processing of some kind. Characterization of thinking time statistics is
therefore of interest for characterizing individual subjects and for guiding
theoretical modeling efforts of cognitive structure and processing.
To the extent that the thinking and typing times in a subject’s FTE response
set are approximated by a log-normal distribution, the response set can be
characterized by the parameters necessary to fit such a distribution to the time
sets. These parameters might serve as useful overall measures of a subject’s
performance on the task. Residual differences between the actual subject
distributions and the best-fit curves might be illuminating, if divergences
between individual subjects’ patterns and the log-normal model can be given a
cognitive interpretation.
Fitting a normal distribution to the logarithms of a set of times produces the
same best-fit parameters as fitting a log-normal distribution to the times
themselves, and is computationally and conceptually easier. Also, rather than fit
a distribution’s probability density function (PDF) to a histogram of data, it is
advantageous to fit the distribution’s cumulative distribution function (CDF) to a
quantile plot of the data, so as to avoid the arbitrariness introduced by choosing
histogram bins. A quantile plot for a set of times is constructed by sorting the set
into increasing order and assigning to each time an ordinate equal to the fraction
of times in the set less than or equal to that time. A time whose quantile value is
0.5 is therefore the median of the set. If a random distribution’s PDF (properly
normalized for total number of points and bin width) should fit a measurement
set’s histogram, then its CDF should fit the corresponding quantile plot. For the
histogram of thinking times shown above in Figure 3.5, the corresponding
quantile plot is presented in Figure 3.7.
Page 65
hidden
43
20
15
10
5
0
co
u
n
t
543210-1
log( tThink / 1 sec)
thinkT histogram & fit PDF
p152f97_34 on FTE
Figure 3.10: Histogram of logarithms of thinking times for subject p152f97-14 on
FTE task, with best-fit curve for normal (Gaussian) probability density function,
normalized for total number of counts.
To aggregate the data, it was necessary to assume that all the data sets are
roughly described by a normal distribution of thinking time logarithms. The
thinking time logarithms from each set could then be “standardized”, i.e. scaled
so that the best-fit normal distribution to the set is the “standard” normal
distribution with a mean of zero and a standard deviation of unity. If the best-fit
parameters of a normal distribution to a subject’s data set are µ and σ, and τi is
the logarithm of the ith thinking time, then the variable transformation which
standardizes that subject’s thinking time logarithms {τi} is xi i≡ −( )τ µ σ . Once
each subject’s data was standardized, all 38 sets studies were aggregated into a
larger data set.
Figure 3.11 shows a quantile plot for the resulting aggregate data. Two fits to
the data are included: a CDF fit for a normal distribution, and a CDF fit for a sum
of two normal distributions (“double normal” distribution), which will be
discussed below. The best-fit values for the normal distribution are close to µ = 0
and σ = 1; this is expected because all individual subject data sets were
standardized to those values before aggregating.
Figure 3.12 displays a histogram of the aggregated values, with PDF curves
shown for the two fits obtained from the quantile plot. Here the data is clearly
seen to two separate peaks. This is not surprising given the spike-plus-peak
shape seen in the individual data sets. The broad main peak of the various sets
Page 67
hidden
45
The double-normal CDF was fit to the aggregate quantile plot with an
iterative χ2 procedure; the resulting best-fit parameters are indicated in
Figure 3.11, and the corresponding CDF and PDF curves are shown on the
quantile plot and histogram, respectively.
250
200
150
100
50
0
co
u
n
t
43210-1-2-3
standardized ln(tThink / 1 sec)
all 3 studies, FTE
aggregated standardized
thinkT histogram
(38 subjects, 2849 counts)
data
normal fit
doubleNormal fit
Figure 3.12: Histogram for logarithms of thinking times, standardized by subject
to a normal distribution, aggregated across all subjects in the p151s99, p172s99,
and p151f97 studies. PDF curves for the normal and double-normal distributions
fit to the previous quantile plot are indicated.
On both the quantile plot and histogram, the double-normal distribution can
be seen to fit significantly better than the normal distribution, especially along
the leading edge of the data. A comparison of the χ2 values for the two fits,
indicated on the quantile plot, supports this observation.
For a comparison between study populations, subject data sets within each
of the studies can be aggregated and fit via the same procedure as used above.
Figure 3.13 and Figure 3.14 show a quantile plot and histogram for aggregated
p151s99 data, with fits; Figure 3.15 and Figure 3.16 show the same for p152f97
data; and Figure 3.17 and Figure 3.18 show the same for p172s99 data.
Page 74
hidden
52
weak enough to be of questionable significance. Such a positive correlation might
result if longer typing times correlate with terms that are more difficult to type
(perhaps due to unfamiliar spelling or approximate mathematical notation), with
a corresponding cognitive load that prevents thinking of the subsequent term. It
might also indicate general fatigue, so that slower typing accompanies the longer
pauses which occur late in the task. Another possibility is that subjects tend to
enter longer terms later in the task as, having exhausted their vocabulary of
shorter, simpler terms, they enter longer and more esoteric (perhaps compound)
terms.
p151s99 µ σ χ∗χ p152f97 µ σ χ∗χ
-01 1.07 0.59 0.328 -01 0.95 0.67 0.029
-02 1.23 0.62 0.071 -02 1.58 0.59 0.041
-03 1.20 0.53 0.015 -03 1.49 0.79 0.026
-04 1.23 0.65 0.078 -04 1.99 0.52 0.021
-05 1.42 0.61 0.041 -11 2.54 1.09 0.166
-06 1.85 0.54 0.110 -12 1.58 0.50 0.049
-07 1.19 0.54 0.114 -13 2.29 0.99 0.102
-08 1.33 0.64 0.047 -14 1.25 0.63 0.038
-09 1.05 0.63 0.035 -15 1.45 0.72 0.028
-10 1.86 0.77 0.034 -21 1.20 0.85 0.030
-11 1.48 0.51 0.031 -22 1.46 0.74 0.040
-12 1.43 0.66 0.020 -24 1.24 0.60 0.061
-13 0.98 0.62 0.012 -25 1.77 0.45 0.053
-14 1.28 0.64 0.052 -31 1.56 0.80 0.050
-15 1.41 0.75 0.044 -32 * * *
-16 1.56 0.57 0.477 -33 1.57 0.66 0.050
mean 1.35 0.62 0.094 -34 2.22 0.77 0.035
std. dev. 0.25 0.07 0.127 -35 1.44 0.53 0.023
mean 1.62 0.70 0.050
p172s99 µ σ χ∗χ std. dev. 0.42 0.17 0.036
-01 1.05 0.63 0.044
-02 1.68 0.57 0.052
-03 1.55 0.60 0.059
-04 1.35 0.74 0.069
-05 1.55 0.56 0.053
mean 1.43 0.62 0.056
std. dev. 0.24 0.07 0.009
Table 3.2: Best-fit parameters when a normal distribution is fit to the set of
logarithms of thinking times, for the FTE task, of subjects in the p151s99, p152f97,
and p172s99 studies. (Subject p152f97-32 misinterpreted the task instructions in a
way that made that his/her data worthless.)
Further analysis of the existing data might suggest one or another of these
possibilities, but the issue of how the mechanical aspects of term entry interact
with the cognitive aspects is general and important enough to warrant a carefully
Page 75
hidden
53
designed study of its own. Such a study might have subjects perform an FTE
verbally, with timing data extracted after the fact from an audio recording, or in
writing, with timing data extracted from a videotape. Comparisons with a typed,
computer-mediated FTE as used in this study might clarify the impact of typing
on task performance. Independent measures of subjects’ facility at typing would
be another useful data source.
2
3
4
5
6
7
1
2
3
4
5
6
7
10
2
3
4
5
6
7
100
th
in
ki
ng
ti
m
e
fo
r t
er
m
i
[se
c]
5 6 7 8 9
1
2 3 4 5 6 7 8 9
10
typing time for term (i - 1) [sec]
p151s99-01 on J1_FTE
Figure 3.20: Successor scatterplot for thinking time against previous typing time,
for subject p151s99-01 on task J1_FTE.
3.1.1.5. Temporal Correlations
The previous sections have examined FTE start time differences, thinking
times, and typing times as if they were uncorrelated numbers drawn from a
Page 80
hidden
58
Clustering
Although the timelines appear to show clustering, it is not obvious that this
clustering isn't an illusion of statistical fluctuations. Here, we use the term
clustering to mean a tendency for terms to come in runs separated by short times
that differs statistically from what would be expected for uncorrelated random
variables. Any series of numbers drawn from a random distribution weighted
towards small values (like a log-normal distribution) will be punctuated by
occasional large values, which, when interpreted as gap lengths in a timeline,
would give the appearance of “clustering” of the intervening shorter values.
Such clustering is no more meaningful than the various runs of consecutive
“heads” that occur during repeated tosses of a biased coin. Clustering of term
entries in a timeline, if statistically significant, indicates that the time differences
in the series are not uncorrelated, but that short time differences tend to come
close together. In other words, the data displays statistical behavior not
modelable by a sequence of uncorrelated values drawn from a random
distribution.
To determine whether the apparent clustering does in fact describe a
significant feature of the data, the series of time differences must be checked for
correlations. A simple way to do this is to construct a successor correlation plot, a
scatterplot of ∆tn vs. ∆tn–1 for all terms in a FTE response set. If time differences
are in fact uncorrelated, the points on the plot should show no correlation. If, on
the other hand, short time differences tend to come in clusters, the data points
should fall along a diagonal line with positive slope. Figure 3.27 displays
successor correlation plots for subjects in the p152f97 study.
Figure 3.27 and similar plots for all other subjects in the p151s99, p172-s99,
and p152-f97 studies show no obvious correlation, which suggests that the
apparent clustering visible in the timelines can be described as statistical
fluctuations in an uncorrelated random variable. Note that this does not imply
that the thinking times in a FTE data set are truly random in origin, or that
response terms are in fact unrelated to each other, or that the apparent clusters
are without meaning; it merely means that the set of thinking times has the
statistical properties of a sequence of uncorrelated random variables. It remains
quite possible that a detailed cognitive model of the processes elicited by the task
could explain the observed data without invoking random distributions.
This “no correlation” result does not even imply that there is no statistical
difference between the observed data and the pattern expected for an
uncorrelated random variable, merely that this test is not sensitive to any
difference that might exist. Other tests might be worth investigating.
Perhaps the most significant conclusion to be drawn from this result is that it
appears unrealistic to seek a purely statistical criterion by which to identify
Page 81
hidden
59
“clusters” in the response list for use in further analysis. If we wish to define
clusters, for example to test the hypothesis that clusters contain sequences of
terms which are related in the topic domain, an external rule must be imposed.
An example of such a rule might be “a cluster is defined as a sequence of terms
separated by thinking times of less than τ c , preceded and followed by thinking
times greater than τ c ,”, which depends on a choice of τ c .
2
3
4
5
6
7
1
2
3
4
5
6
7
10
2
3
4
5
6
7
100
th
in
ki
ng
ti
m
e
fo
r t
er
m
i
[se
c]
2 3 4 5 6 7 8
1
2 3 4 5 6 7 8
10
2 3 4 5 6 7 8
100
thinking time for term (i-1) [sec]
p151s99-01 on J1_FTE
Figure 3.27: Successor correlation plot for thinking times, for subject p151s99-01
on task J1_FTE.
3.1.2. FTE Thinking Times vs. Term Relatedness
In Subsection 3.1.1, analysis of FTE data focused on timing information and
ignored the meanings of the terms entered by subjects. This section will attempt
Page 83
hidden
61
notion of whether terms are related, but have difficulty explicitly identifying
their criteria. In addition, experts seem to use contextual information in their
judgments, inferring what the subject was thinking while he/she entered a series
of terms, and deciding whether a term is a jump in that context.
The following list attempts to specify some of the criteria used to decide
whether a pair of terms was “relatively related”:
 They were both within a sufficiently small topic area (e.g. collisions,
graphs, angular momentum);
 They were analogous elements of a set or list (e.g. kinds of forces, units
of measure, key principles of mechanics);
 One was a subclass or special case of the other (e.g. “force” and “spring
[force]”, “motion” and “rotation”);
 One was a situation or problem type in which the other figures
significantly (e.g. “falling objects” and “gravity”, “collision” and
“impulse”);
 They were mathematically related (e.g. “work” and “impulse”,
“velocity” and “position”);
 One was a feature of the other (e.g. “slope” and “graph”, “force” and
“free-body diagram”).
This is not a complete list, but it illustrates the kinds of relationships considered.
Note that a very important question has been ignored so far: related to whom?
The original hypothesis, based on experts’ introspection on their own experience
while performing a FTE, was that long thinking times correlate with terms
unrelated to immediately preceding terms according to their own knowledge
structure. When an expert analyzing the data examines a subject’s list of
responses and identifies terms as jumps or non-jumps, however, the judgment of
relatedness is made according to the expert’s understanding of the domain, not
the subject’s. So, even if the hypothesis is completely correct and thinking times
correlate perfectly with jumps, analysis by an expert would not show a perfect
correlation unless the expert and subject were in complete agreement about what
terms are and are not strongly related.
We assume, however, that an expert with experience teaching the domain
material can make judgments based on a structure that is reasonably close to
what earnest students, or at least the more apt ones, possess. With that
Page 84
hidden
62
assumption, the operational hypothesis to test is that long thinking times will
correlate noisily but significantly with jumps as perceived by an expert.
In fact, if the original hypothesis is correct and thinking times reveal what
are and are not jumps to the subject, then this task could provide a mechanism
for comparing parts of a subject’s conceptual knowledge structure to an expert’s.
If a subject entered a term after a short thinking time but the term appears to be a
jump to an expert, then perhaps the subject has attached importance to a link
which ought not to be so important; this might indicate a misconception. The
converse case seems less informative: if a subject enters a term that an expert
considers related but does so after a long thinking time, it is not clear whether
the subject does not in fact associate the term with its predecessors very strongly,
or whether he/she considered several other terms and rejected them (perhaps
because they were entered earlier in the task), or whether he/she was simply
distracted for a span of time.
But first, a correlation between thinking times and term relatedness must be
established.
3.1.2.2. Distributions of Jump and Non-Jump Thinking Times
For each of the 16 subjects in the p151s99 study, an expert in introductory
mechanics with experience teaching the subject (the author) reviewed the list of
response terms for the task J1_FTE, and classified each term as a jump or non-
jump as explained above. The set of thinking times for the subject’s task
performance was then divided into a subset containing thinking times for jumps
and a subset containing thinking times for non-jumps. Figure 3.28 shows
histograms of these two subsets for subject p151s99-01, superimposed on the
same axes. For comparison, Figure 3.29 shows the two distributions as stacked
histograms, revealing the histogram for the set of all thinking times. In keeping
with Subsection 3.1.1, the natural logarithms of the thinking times have been
used rather than the times themselves.
Figure 3.30 displays one of the noisier of such histogram comparisons, for a
subject whose data contains relatively few terms. While some of the data sets are
too noisy to identify a clear peak for both histograms, for all but one of the 16
subjects, the mean and median of the jump distribution is clearly larger than the
mean and median of the non-jump distribution. The one exception is subject
p151s99-14, whose data set contains atypically few points, resulting in atypically
sparse, noisy histograms with similar means and medians.
The general pattern is clear: for any given subject, the thinking times
associated with jumps are generally larger than the thinking times associated
with non-jumps, but the two distributions overlap significantly. There are
typically more non-jumps than jumps, although the ratio varies by subject. For
Page 87
hidden
65
2. Choose a threshold time which produces the same jump rate as an
expert’s categorizations;
3. Choose a threshold time which maximizes the success rate of the
resulting predictions.
Methods 1 and 3 are in fact equivalent if there are enough data points so that
the data’s discreteness is not an issue. This can be understood by looking at
Figure 3.28 or Figure 3.29 and considering a vertical line drawn at a horizontal
coordinate where the thinking time is equal to the chosen threshold time. The
total number of events to the left of that line due to both distributions is the
predicted number of non-jumps, while the total number to the right is the
predicted number of jumps. Moving that line to the right (i.e. increasing the
threshold time) increases the number of predicted jumps. Every time the line
passes a thinking time corresponding to a term entry while moving to the right,
the predicted classification of that term changes from incorrect to correct if the
term is part of the non-jump histogram, increasing the success rate; if the term is
part of the jump histogram, the success rate is decreased. Assuming the
distribution for jumps peaks farther to the right than the distribution for non-
jumps, the maximal success rate must therefore occur at the point at which the
two distribution curves (approximated by histograms) cross.
With discrete data rather than idealized continuous distributions, multiple
crossing points are possible, in which case the success rate has multiple local
maxima; the largest should be chosen. There may exist multiple maxima of equal
height, in which case a rule must be defined to resolve the ambiguity.
Figure 3.31 and Figure 3.32 show plots of success rate vs. threshold time for
the two example subjects of Figure 3.28 and Figure 3.30. The effect of discreteness
for small data sets is clearly visible: the first subject entered 174 terms, and the
second entered 67.
Table 3.3 shows optimal threshold times and the corresponding maximized
success rates for each subject as determined by method 3, calculated numerically
from the data rather than from histograms to avoid binning effects. For a given
subject, if the maximum success rate value occurred for multiple values of the
threshold time, the reported threshold time value is the logarithmic mean of
those values.
When interpreting the success rates, consider that if the threshold-time
prediction and the expert assignment are perfectly correlated, the success rate
will be 1; if they are completely uncorrelated, it will have a statistical expectation
value of

f f f fp j p j+ −( ) −( )1 1 , where fp is the jump rate according to the threshold-
time prediction, and fj is the jump rate according to the expert’s categorization
judgments. The table includes columns for the jump rate according to the expert
Page 89
hidden
67
subject
cutoff
t i m e
cutoff
time
l n ( )
max
success
r a t e
judged
jump
r a t e
pred.
jump
r a t e
uncor.
success
r a t e
error
rate
r a t i o
p151s99-01 20.67 3 .03 0 .85 0 .19 0 .08 0 .76 0 .63
- 0 2 14.03 2.64 0.79 0.30 0.20 0.62 0.56
- 0 3 7.00 1.95 0.79 0.40 0.44 0.51 0.43
- 0 4 18.74 2.93 0.72 0.42 0.23 0.54 0.61
- 0 5 3.86 1.35 0.70 0.54 0.60 0.51 0.61
- 0 6 7.00 1.95 0.87 0.27 0.26 0.61 0.33
- 0 7 10.53 2.35 0.79 0.43 0.27 0.53 0.45
- 0 8 2.54 0.93 0.79 0.46 0.63 0.49 0.42
- 0 9 13.22 2.58 0.77 0.33 0.33 0.56 0.52
- 1 0 19.12 2.95 0.86 0.24 0.08 0.71 0.49
- 1 1 4.07 1.40 0.83 0.46 0.48 0.50 0.33
- 1 2 11.99 2.48 0.94 0.25 0.22 0.64 0.17
- 1 3 14.13 2.65 0.84 0.32 0.24 0.59 0.40
- 1 4 22.94 3.13 0.77 0.28 0.05 0.70 0.77
- 1 5 3.82 1.34 0.70 0.46 0.48 0.50 0.60
- 1 6 8.39 2.13 0.79 0.37 0.32 0.55 0.45
Table 3.3: Selected threshold times and corresponding success rates for
comparison of predicted and expert-judged “jump” vs. “non-jump” term
categorization, for p151s99 study, task J1_FTE; with comparison to success rate
expected if prediction and expert judgment are uncorrelated (see text).
Define the error rate of a prediction to be the success rate subtracted from
one; that is, the fraction of terms that were mispredicted. The final column shows
the ratio of the error rate of the prediction to the error rate expected for
uncorrelated predictions; values less than one indicate a smaller error rate (better
prediction), while values greater than one indicate a higher error rate (poorer
prediction). The average of that ratio across subjects is 0.49, indicating that the
threshold-time prediction method employed in this section produces about half
the errors that would be obtained by a random coin-toss with bias equal to the
number in the “predicted jump rate” column.
Whether the listed success rates are considered adequate depends on the use
one intends for the resulting predictions. For an ideal case where a subject’s
distribution of thinking times fell into two distinct peaks, and where an expert
judged most of the terms comprising the first peak to be non-jumps and most in
the second peak to be jumps, identifying the few jumps in the first peak and the
few non-jumps in the second peak would likely be of value for pedagogic and
research purposes. For such a case, the threshold method described above would
suffice. But for a case like that displayed in Figure 3.28, the threshold method
seriously overpredicts the number of non-jumps. If the threshold is selected by
methods 1 or 3, almost all terms are predicted to be non-jumps. As a result, the
majority of jumps are mispredicted as non-jumps.
Page 90
hidden
68
As discussed at the end of Subsubsection 3.1.2.1 above, jumps with the
timing signature of non-jumps are likely to be of more cognitive and pedagogic
interest than the converse case. The threshold method tends to overpredict such
events, reducing their usefulness. Threshold-determination method 2, requiring
the jump rate to be the same for predictions and expert judgments, would force
the threshold line left of the histograms’ crossing point on a case like that of
Figure 3.28, reducing the number of falsely predicted non-jumps at the expense
of very sharply increasing the number of falsely predicted jumps. This might be
of benefit to a cognitive or pedagogic analysis. Without a specific analysis in
mind, further discussion is not fruitful.
3.1.2.4. Incorporating Elapsed Task Time in Jump Predictions
Figure 3.33 shows a plot of thinking time vs. start time for an example
subject. Each data point represents one term-entry event, and the horizontal axis
indicates the start time of the event (the time elapsed in the task when the term
was entered). Data point markers indicate whether each term was classified as a
non-jump (cross) or jump (circle) by the expert judge.
Examining such plots for all subjects in the p151s99 study reveals some
general trends:
1. Thinking times are scattered within an envelope that increases as the
task progresses (i.e. as start time increases), in agreement with the
discussion on decreasing term entry density in Subsection 3.1.1.5.
2. The density of jumps relative to non-jumps is higher in the later part of
the task than in the earlier part.
3. Overall, jumps have larger thinking times than non-jumps, in
agreement with the previous section’s findings.
In this representation, the threshold time prediction method of the previous
section corresponds to drawing a horizontal line through the plot, and predicting
that all points above the line correspond to jumps and all points below the line
correspond to non-jumps. The fact that no such line cleanly divides the jump
points from the non-jump points is consistent with the fact that the two
histograms of Figure 3.28 overlap.
It is likely that a non-horizontal line, or even some kind of parameterized
curve, might be more successful at partitioning the jumps from the non-jumps.
This is equivalent to modifying the threshold-time prediction method to use a
threshold that varies with elapsed task time (start time). Although success rates
for such a method have not been calculated, examining graphs like Figure 3.33
Page 91
hidden
69
for all subjects in the p151s99 study suggests that for some subjects it would be
significantly more successful, while for others (including the example subject
shown above) the improvement would be minor. Again, whether such methods
are useful depends on the purpose one has for the results.
100
80
60
40
20
0
th
in
ki
ng
ti
m
e
[se
c]
150010005000
p151s99-01 on J1_FTE
non-jump
jump
Figure 3.33: Term thinking time vs. term start time (relative to beginning of task)
for subject p151s99-01 on task J1_FTE. Symbols indicate expert’s classification of
terms (see text).
3.1.2.5. Suggestions for Further Research
The line of inquiry discussed here in Section 3.1.1 could be pursued in
several ways. One would be to reduce the noise introduced by the expert’s
judgment of which terms should be categorized as jumps or non-jumps. A
simple improvement would be to have a panel of experts make the judgments,
rather than one expert. Explicitly identifying criteria for the experts to apply
should aid consistency of judgment.
Going a step further in this direction, a “reference proximity matrix” could
be constructed, with each cell containing a numerical value representing the
proximity or “relatedness” of a pair of terms. Constructing such a matrix would
be a laborious task, perhaps achieved by subjecting several experts to a “term
proximity judgment” (TPJ) task (cf. Section 2.1.3.6). Once the matrix exists, a rule
could be defined which categorizes terms as jumps or non-jumps according to a
numerical criterion based on that term’s proximity values to a specified number
of preceding terms. As a coincidental benefit, comparing a matrix-based
Page 94
hidden
72
0.60
0.55
0.50
0.45
0.40
0.35
0.30
jum
p r
ate
9080706050403020
exam 2 score
p152f97
r = -0.650 (0.12)
Figure 3.34: Subject FTE jump rate vs. Exam 2 score for p152f97 study.
During the identification of “jumps”, we noticed a general trend: more jumps
seemed to occur in the second half of the FTE response sequence than in the first.
This agrees with subjects’ testimony when interviewed: toward the end of an
FTE task, subjects generally experience a greater sense of “hunting around” in
their memories for terms that they haven't yet entered, whereas in the beginning
they enter terms almost continuously.
This suggests that the first half of the FTE response sequence may be a better
indicator of structure than the later part. We therefore repeated the comparison
of r-values with exam scores, using only the first half of each subject’s FTE
response sequence to calculate a jump rate.
For the p152f97 study, Table 3.4 shows r-values for the correlation between
subjects’ jump rates and their various exam scores. Table 3.5 shows results for the
same calculation for the p151s99 study’s J1_FTE task. Table 3.6 shows the same
for the p172s99 study’s B1_FTE task.
Exam 1 Exam 2 Exam 3 Final
all responses -0.489 -0.650 -0.059 -0.665
first 1/2 responses -0.244 -0.586 -0.587 -0.506
Table 3.4: Pearson’s r-value for correlation between subject jump rates and exam
scores in p152f97 FTE task. r > 0 12. for statistical significance with an 18-point
sample.
Page 97
hidden
75
sequel course Physics 172 during the intervening semester, which might have
impacted their knowledge structure and recall, perhaps “diluting” it with the
addition of new links.
3.1.3.4. Suggestions for Future Research
The correlation between jump rate and exam performance seems strong
enough to warrant further study. Improving the procedure for identifying jumps,
as discussed in Section 3.1.2.5, would be of benefit.
A better indicator of conceptual domain expertise than course exam scores is
crucial. The study should be repeated, presenting subjects with a “problem-
solving task” along with a FTE task. The new task should require subjects to
solve carefully-crafted problems designed to test conceptual understanding of
the domain material. This should remove a tremendous amount of noise and
confounding from the correlation being studied and allow a more reasonable
assessment of the hypothesis that FTE jump rate correlates with expertise.
3.1.4. Summary of FTE Findings
To summarize the findings of Section 3.1: the timing information contained
in a subject’s FTE response list data can be separated into a set of thinking times
which describe the approximate amount of time the subject spent thinking about
each term, and a set of typing times which describe the approximate amount of
time the subject spent typing each term. For an entire response list, the set of
thinking times approximately follows a log-normal distribution, although for
most subjects there is a narrow, tall spike superimposed on the leading edge of
the generally Gaussian peak when the distribution is viewed on a logarithmic
scale. The typing times do not display this peak. When individual subjects’ sets
of thinking time logarithms were rescaled to a common mean and width and
then aggregated together, the resulting aggregate set displayed a clear two-
peaked shape which was fit well by a linear combination of two Gaussian peaks.
Individual data sets were in general too noisy to fit well with this five-parameter
curve, however.
In checking the hypothesis that at least some subjects thought about their
next terms while typing a term, it was found that there was no correlation
between a term entry event’s thinking time and the previous event’s typing time.
If significant thinking occurred during typing, one might expect to see an inverse
correlation. A few subjects showed a slight tendency for longer thinking times to
follow longer typing times, a phenomenon which has various plausible
explanations.
Subjects’ response lists showed a general pattern of decreasing density,
meaning that the number of terms entered per unit time, suitably averaged,
Page 100
hidden
78
tabulated. Table 3.8 displays these quantities for each subjects’ map from the
H2_HDCM task. Similar tables were constructed for the other three HDCM tasks
of the study (not shown here).
subjec t #nodes # l i n k s r a t i o level counts
p151s99-01 2 1 3 4 1.62 {9, 10, 2}
p151s99-02 1 8 2 9 1.61 {9, 8, 1}
p151s99-03 1 6 2 2 1.38 {5, 6, 4, 1}
p151s99-04 2 3 3 5 1.52 {5, 10, 5, 1, 1, 1}
p151s99-05 1 6 1 9 1.19 {7, 6, 3}
p151s99-06 3 1 3 7 1.19 {4, 9, 6, 4, 5, 3}
p151s99-07 4 0 7 3 1.83 {6, 14, 18, 2}
p151s99-08 3 7 5 6 1.51 {8, 11, 10, 8}
p151s99-09 2 9 5 2 1.79 {9, 13, 7}
p151s99-10 9 1 0 1.11 {6, 3}
p151s99-11 2 0 2 3 1.15 {4, 10, 5, 1}
p151s99-12 3 2 4 2 1.31 {12, 10, 5, 5}
p151s99-13 2 4 4 5 1.88 {10, 12, 1, 1}
p151s99-14 2 5 3 9 1.56 {4, 8, 6, 6, 1}
p151s99-15 2 5 3 3 1.32 {6, 11, 7, 1}
p151s99-16 2 9 4 7 1.62 {5, 11, 8, 5}
Table 3.8: HDCM Statistics for subjects’ maps from task H2_HDCM. See text for
column definitions.
A few subjects misunderstood the task instructions and drew maps with
invalid constructs. Two kinds of invalid construct were encountered: duplicate
nodes and branching links. In order to analyze these maps and generate the
quantitative data required, an “equivalent” valid map construct was created to
replace each invalid construct, and analysis proceeded with the valid constructs.
A duplicate node occurred when the subject put more than one node
containing the same term on a map. To create an equivalent valid construct, this
was corrected by treating all duplicate nodes as if they were topologically one
node. Thus, all duplicate versions would have the same level, determined by the
level of the one nearest to the prompt term node.
Branching links occurred when a subject drew a line that had branches or
intersections, so that it connected more than two nodes together. Determining an
equivalent valid construct required a subjective judgment to be made about the
subject’s intentions when drawing the branching link, which were not always
obvious. For example, if a link from node A forked to connect to nodes B and C,
should that be replaced by three valid links connecting all three pairs of nodes, or
only links from A to each of B and C? The decisions made during analysis in

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

1 Reader on Mendeley
by Discipline
 
100% Physics
by Academic Status
 
100% Assistant Professor
by Country
 
100% United States