内搜索:

评估术语表

核心评估与评价术语

Terminology from the National Council on Measurement in 教育 (NCME)

(Selected terms 和 their definitions are provided below. 完整术语表)

  • 能力/参数- 项目反应理论(IRT), a theoretical value indicating the level of a test taker on the ability or trait measured by the test; analogous to the concept of 真实的分数 in classical test theory.
  • 〇能力测试 The use of tests to evaluate the current performance of a person in some defined domain of cognitive, 精神运动, 或者生理功能.
  • 访问/可访问性- 测试中的项目或任务使尽可能多的考生能够证明他们在目标构念上的立场,而不受与被测构念无关的项目特征的阻碍的程度.
  • 成就测验—— 一种测试,用来衡量考生在接受指导的内容领域中所获得的知识或技能的程度.
  • 评估- Any systematic method of obtaining information from tests 和 other sources, 用来推断人的特点, 对象, or 项目; a process designed to systematically measure or evaluate the characteristics or performance of individuals, 项目, 或者其他实体, for purposes of drawing inferences; sometimes used synonymously with test.
  • 评估能力 Knowledge about testing that supports valid interpretations of test scores for their intended purposes, 比如关于测试开发实践的知识, 考试成绩解释, 对有效分数解释的威胁, 评分的可靠性和精度, 测试管理, 和使用.
  • 〇真实评估 一种评估,包含被判断为衡量在现实环境中newbb电子和使用知识的能力的项目.
  • 成就水平/熟练程度- Descriptions of a test taker's level of competency in a particular area of knowledge or skill, 通常定义为连续统上的有序范畴, 通常被标记为从“基本”到“高级”,或者从“新手”到“专家”," that constitute broad ranges for classifying performance.
  • 基准评估- Assessments administered in educational settings at specified times during a curriculum sequence, to evaluate students' knowledge 和 skills relative to an explicit set of longer-term learning goals. 看到 中期评估
  • 偏差- 1. 在测试公平性方面, 构建不充分表征或构建不相关的测试分数成分,这些成分会对不同组的考生的表现产生不同的影响,从而影响对测试分数的解释和使用的可靠性/精度和有效性. 2. In statistics or measurement, systematic error in a test score. 看到 predictive bias, construct underrepresentation, construct irrelevance, fairness.
  • 认证, 一个过程,通过这个过程,个人被认可(或认证)为在某一领域展示了某种程度的知识和技能. 看到 执照,证书.
  • 〇经典测试理论 一种心理测量学理论,其基础是个体在测试中观察到的分数是被测试者的真实分数分量和独立随机误差分量的总和.
  • 构造- Concept or characteristic the test is designed to measure.
  • 构建无关方差 Variance in test-taker scores that is attributable to extraneous factors that distort the meaning of the scores, ,从而, 降低所建议的解释的有效性.
  • 趋同证据—— Evidence based on the relationship between test scores 和 other measures of the same or related construct.
  • 的认证, 授予某人, 根据某些权威, 一个证书, 比如证书, 许可证, 或文凭, that signifies an acceptable level of performance in some domain of knowledge or activity.
  • 标准参照分数解释- The meaning of a test score for an individual or an average score for a defined group, indicating an individual’s or group’s level of performance in relationship to some defined criterion domain. Examples of criterion-referenced interpretations include comparison to cut scores, 基于期望表的解释, 和 domain- referenced score interpretations (与 标准参照分数解释.)
  • 〇扣分 分数:分数表上的一个特定点, 分数达到或高于该点的, 解释, 或者与低于这个分数的人有不同的行为.
  • 衍生分数—— A score scale to which raw scores are converted to enhance their interpretation. Examples are percentile ranks, st和ard scores, 和 grade-equivalent scores.
  • 〇经验证据 Evidence based on some form of data, as opposed to that based on logic or theory.
  • 测量误差 The difference between an observed score 和 the corresponding 真实的分数. 看到 st和ard 测量误差, systematic error, r和om error 真实的分数.
  • 评价- - - - - - 这个过程 of gathering information to make a judgment about the quality or worth of some program or performance. The term also is used to refer to the judgment itself, as in “My evaluation of his work is . . . .”
  • 无关方差—— 由于个体之间的差异而产生的测验成绩的变化,这些差异与测验的目的无关. 例如, 要求数学技能和阅读能力超出其内容范围的科学测试将有两个外来方差的来源. 在这种情况下, 学生的科学成绩可能会有所不同, not only because of differences in their science achievement, but also because of differences in their (extraneous) mathematics 和 reading abilities. (参见构造无关性.)
  • 形成性评估—— 教师和学生在教学过程中使用的一种评估过程,它提供反馈,以调整正在进行的教学和学习,目的是提高学生达到预期的教学成果.
  • 〇获得分数 在测试中, 差:同一考生在同一考试中取得的两次分数或在不同场合进行的两次相等考试的分数之差, 通常是在治疗前后.
  • 概括性理论 评估可靠性/精度的方法框架,其中通过newbb电子方差分析的统计技术估计各种来源的误差方差. The analysis indicates the generalizability of scores beyond the specific sample of items, 人, 以及研究的观测条件.
  • 〇高风险考试 一种用于提供具有重要意义的结果的测试, 对个人的直接影响, 项目, 或参与测试的机构. 与 可能发生失事的测试.
  • 内部协议/一致性- The level of consistency with which two or more judges rate the work or performance of test takers. 看到 两分的可靠性. 两分的可靠性: consistency in rank ordering of ratings across raters. 看到 两分的协议.
  • 内部可靠性- The degree of agreement among repetitions of a single rater in scoring test takers’ responses. 评分过程中的不一致是由评分者内部的影响而不是考生表现的真实差异造成的,这导致了评分者内部的低可靠性.
  • 〇低风险测试 A test used to provide results that have only minor or in对个人的直接影响, 项目, 或参与测试的机构. 与 高风险测试.
  • 精通/精通测试 一种测试,旨在表明考生在某一领域是否达到了规定的能力水平. 看到 扣分,计算机基础精通测试.
  • 调节变量—— A variable that affects the direction or strength of the relationship between two other variables.
  • 标准参照分数解释- 将考生的成绩与特定参考人群的成绩分布进行比较而得出的成绩解释. 对比 标准参照分数解释.
  • 〇客观检验 一种不需要评分者个人解释(主观)就能评分的考试. Tests that contain multiple choice, true-false, 和 matching items are examples.
  • 绩效评估 - 通过完成需要这些技能的任务,考生实际展示了考试所要衡量的技能.
  • 性能标准 - Descriptions of levels of knowledge 和 skill acquisition contained in content st和ards, 通过性能水平标签(e.g., “基本”, “精通”, “高级”), statements of what test takers at different performance levels know 和 can do, 和 cut scores or ranges of scores on the scale of an assessment that differentiate levels of performance. 看到 cut scores, performance level, performance level descriptor.
  • 〇随机误差 A non-systematic error; a component of test scores that appears to have no relationship to other variables.
  • 〇原始分数 The score on a test that is often calculated by counting the number of correct answers, but more generally a sum or other combination of item scores.
  • 可靠性/精密- 一组考生的考试成绩在一种测量方法的重复newbb电子中保持一致的程度,因此被推断为可靠的, 和 consistent for an individual test taker; the degree to which scores are free of r和om errors of measurement for a given group. 看到 普遍性理论, classical test theory, precision of measurement.
  • 可靠性 系数, A unit-free indicator that reflects the degree to which scores are free of r和om measurement error. 看到 普遍性理论.
  • 响应 偏见, A test taker's tendency to respond in a particular way or style to items on a test (e.g., 默许, 选择社会期望的选项 , choice of 'true' on a true-false test) that yields systematic, 考试成绩中与结构无关的错误.
  • 评分标准- 既定标准, 包括规则, 原则, 和插图, used in scoring constructed responses to individual tasks 和 clusters of tasks
  • 测量标准误差- 在相同的条件下,一个人从重复的测试(或测试的平行形式)中观察到的分数的标准偏差. 因为这些数据通常是无法收集到的, the st和ard 测量误差 is usually estimated from group data. 看到 测量误差.
  • 〇标准制定 这个过程, 通常基础, 使用一个结构化的过程来设定分数,这个过程旨在确定分数,这些分数定义了由性能级别和性能级别描述符指定的不同性能级别.
  • 标准化, 1. 在考试管理中, 维护一致的测试环境,并根据详细的规则和规范进行测试, so that testing conditions are the same for all test takers on the same 和 multiple occasions. 2. 在测试开发中, 建立规范的基础上,从一个具有代表性的样本的个人的测试表现,其中的测试是打算使用.
  • 总结性评估- 对考生的知识和技能的评估,通常在学习计划完成时进行, 比如一个教学单元的结尾.
  • 〇真实得分 在经典测试理论中, 一个人在同一考试的无限份严格平行的表格中所得到的平均分数.
  • 验证- - - - - - 通过对考试分数的预期用途的解释的有效性进行调查的过程.
  • 有效性- 在一定程度上积累的证据和理论支持对考试成绩的特定解释. If multiple interpretations of a test score for different uses are intended, 每种解释都需要有效性证据.
  • 加权分数/计分- 一种对测试进行评分的方法,对正确的(或诊断相关的)回答给予一定的分数. 在某些情况下, the scoring formula awards more points for one response to an item than for another response.