While Vol. IV focused on variable gauge performance, this installment of “The War on Error” presents the study of attribute gauges. Requiring the judgment of human appraisers adds a layer of nuance to attribute assessment. Although we refer to attribute gauges, assessment may be made exclusively by the human senses. Thus, analysis of attribute gauges may be less intuitive or straightforward than that of their variable counterparts.
Conducting attribute gauge studies is similar to variable gauge R&R studies. The key difference is in data collection – rather than a continuum of numeric values, attributes are evaluated with respect to a small number of discrete categories. Categorization can be as simple as pass/fail; it may also involve grading a feature relative to a “stepped” scale. The scale could contain several gradations of color, transparency, or other visual characteristic. It could also be graded according to subjective assessments of fit or other performance characteristic.
Before the detailed discussion of attribute gauge studies begins, we should clarify why subjective assessments are used. The most obvious reason is that no variable measurement method or apparatus exists to evaluate a feature of interest. However, there are some variable measurements that are replaced by attribute gauges for convenience. Variable gauges that are prohibitively expensive, operate at insufficient rates, or are otherwise impractical in a production setting, are often supplanted by less sophisticated attribute gauges. Sophisticated equipment may be used to create assessment “masters” or to validate subjective assessments, while simple tools are used to maintain productivity. Periodic verification ensures that quality is not sacrificed to achieve desired production volumes.
Although direct calculations of attribute gauge repeatability and reproducibility may not be practical without the use of a software package, the assessments described below are analogous. The “proxy” values calculated provide sufficient insight to develop confidence in the measurement system and to identify opportunities to improve its performance.
Like variable gauge R&R studies, evaluation of attribute gauges can take various forms. The presentation below is not comprehensive, but an introduction to common techniques. Both types of analysis require similar groundwork to be effective; readers may want to review the “Preparation for a GRR Study” section in Vol. IV before continuing.
Attribute Short Study
The “short method” of attribute gauge R&R study requires two appraisers to evaluate each of (typically) 30 parts twice. This presentation uses a binary decision – “accept” or “reject” – to demonstrate the study method, though more categories could be used. Exhibit 1 presents one possible format of a data collection and computation form. Use of the form is described below, referencing the number bubbles in Exhibit 1.
1: Each appraiser’s evaluations are recorded as they are completed, in random order, in the columns labeled “Trial 1” and “Trial 2.” Simple notation, such as “A” for “accept” and “R” for “reject” is recommended to simplify data collection and keep the form neat and legible.
2: Decisions for each part are compared to determine the consistency of each appraiser’s evaluations. If an appraiser reached the same conclusion both times s/he evaluated a part, a “1” is entered in that appraiser’s consistency column in the row corresponding to that part. If different conclusions were reached, a “0” is recorded. Below the trial data entries, the number of consistent evaluation pairs is tallied. The proportion of parts receiving consistent evaluations is then calculated and displayed as a percentage.
3: The standard evaluation result is recorded for each part. The standard can be determined via measurement equipment, “expert” evaluation, or other trusted method. The standard must be unknown to the appraisers during the study.
4: The first two “Agreement” columns record where each appraiser’s evaluation matches the standard (enter “1”). A part’s evaluation can only be scored a “1” for agreeing with the standard if the appraiser reached the same conclusion in both trials. Put another way, for A=Std = 1, A=A must equal 1; if A=A = 0, A=Std is automatically ”0.” Column results are totaled and percentages calculated.
5: Place a “1” in the “A/B” Agreement column for each part with consistent appraiser evaluations (A=A = 1 and B=B = 1) that match. If either appraiser in inconsistent (A=A = 0 or B=B = 0) or the two appraisers’ evaluations do not match, enter “0” in this column. Total the column and calculate the percentage of matching evaluations.
6: The final column records the instances when both appraisers were consistent, in agreement with each other, and in agreement with the standard. For each part that obtained these results, a “1” is entered in this column; all others receive a “0.” Again, total the column results and calculate the percentage of parts that meet the criteria.
The percentage values calculated in Exhibit 1 are proxy values; the objectives are inverted compared to variable gauge R&R studies. That is, higher values are desirable in attribute gauge short studies.
The consistency values (A=A, B=B) are analogous to repeatability in variable gauge studies. Appraiser’s results could be combined to attain a single value to more-closely parallel a variable study; however, this is not a standard. Caution must be exercised in its use; explicit explanation of its calculation and interpretation must accompany any citation to prevent confusion or misuse.
A/B Agreement (A=B) is analogous to variable gauge reproducibility; it is an indication of how well-developed the measurement system is. Better-developed attribute systems will produce more matching results among appraisers, just as is the case with variable systems.
The composite value in the attribute study, analogous to variable system R&R, is Total Agreement (A=B=Std). This value reflects the measurement system’s ability to consistently obtain the “correct” result over time when multiple appraisers are employed.
While the calculations and interpretations are quite different from a variable R&R study, the insight gained from an attribute short study is quite similar. The results will aid the identification of improvement opportunities, whether in appraiser training, refining instructions, clarifying acceptance standards, or equipment upgrades. The attribute short study is an excellent starting point for evaluating systems that, historically, had not received sufficient attention to instill confidence in users and customers.
Effectiveness and Error Rates
Perhaps even shorter than the short study described above, measurement system effectiveness can be calculated to provide shallow, but rapid, insight into system performance. Measurement system effectiveness is defined as:
Effectiveness assessments are typically accompanied by calculations of miss rates and false alarm rates. Although all of these values are defined in terms of a measurement system, they are calculated per appraiser.
An appraiser’s miss rate represents his/her Type II error and is calculated as follows:
The false alarm rate represents an appraiser’s Type I error; it is calculated as follows:
Appraisers’ performance on each metric are compared to threshold values – and to each other – to assess overall measurement system performance. One set of thresholds used for this assessment is presented in Exhibit 2. These guidelines are not statistically derived; they are empirical results. As such, users may choose to modify these thresholds to suit their needs.
Identification of improvement opportunities requires review of each appraiser’s results independently and in the aggregate. For example, low effectiveness may indicate that an appraiser requires remedial training. However, if all appraisers demonstrate low effectiveness, a problem more deeply-rooted in the measurement system is possibly the cause. This type of discovery is only possible when both levels of review are conducted. More sophisticated investigations may be required to identify specific issues and opportunities.
Cohen’s Kappa, often called simply “kappa,” is a measure of agreement between two appraisers of attributes. It accounts for agreement due to chance to assess the true consistency of evaluations among appraisers. A data summary table is presented in Exhibit 3 for the case of two appraisers, three categories, and one trial. The notation used is as follows:
Total agreement between appraisers, including that due to chance, is
where n = ∑Rows = ∑Cols = the total number of evaluation comparisons. In order to subtract the agreement due to chance from total agreement, the agreement due to chance for each categorization is calculated and summed:
where Pi is the proportion of agreements and ci is the number of agreements due to chance in the ith category. Therefore, ∑Pi is the proportion of agreements in the entire study due to chance, or pε. Likewise, ∑ci is the total number of agreements in the entire study that are due to chance, or nε.
To find kappa, use
where pa and na are the proportion and number, respectively, of appraiser agreements. To validate the kappa calculation, confirm that κ ≤ pa ≤ 1. Also, 0 ≤ κ ≤ 1 is a typical requirement. A kappa value of 1 indicates “perfect” agreement, or reproducibility, between appraisers, while κ = 0 indicates no agreement whatsoever beyond that due to chance. Discussion of negative values of kappa, allowed in some software, is beyond the scope of this presentation.
Acceptance criteria within the 0 ≤ κ ≤ 1 range varies by source. Minimum acceptability is typically placed in the 0.70 – 0.75 range, while κ > 0.90 is desirable. If you prefer percentage notation, κ > 90% is your ultimate goal. Irrespective of specific threshold values, a higher value of kappa indicates a more consistent measurement system. Note, however, that Cohen’s Kappa makes no reference to standards; therefore, evaluation of a measurement system by this method is incomplete.
Earning its shorthand title of “long method” of attribute gauge R&R study, the analytic method uses a non-fixed number of samples, known reference values, probability plots, and statistical lookup tables. A quintessential example of this technique’s application is the validation of an accept/reject apparatus (e.g. Go/NoGo plug gauge) used in production because it is faster and more robust than a more precise instrument (e.g. bore gauge).
Data collection begins with the precise measurement of eight samples to obtain reference values for each. Each part is then evaluated twenty times (m = 20) with the attribute gauge; the number of times each sample is accepted is recorded. Results for the eight samples must meet the following criteria:
From the probability calculations, a Gauge Performance Curve (GPC) is generated; the format shown in Exhibit 4 may be convenient for presentation. However, the preferred option, for purposes of calculation, is to create the GPC on normal probability paper, as shown in Exhibit 5. The eight (or more) data points are plotted and a best-fit line drawn through the data. The reference values plotted in Exhibits 4 and 5 are deviations from nominal, an acceptable alternative to the actual measurement value.
Measurement system bias can now be determined from the GPC as follows:
where Xt is the reference value at the prescribed probability of acceptance.
Measurement system repeatability is calculated as follows:
where 1.08 is an adjustment factor used when m = 20.
Significance of the bias is evaluated by calculating the t-statistic,
and comparing it to t0.025,df. For this case, df = m -1 = 19 and t0.025,19 = 2.093, as found in the lookup table in Exhibit 6. If t > 2.093, the measurement system exhibits significant bias; potential corrective actions should be considered.
Like the previous methods described, the analytic method does not provide R&R results in the same way that a variable gauge study does. It does, however, provide powerful insight into attribute gauge performance. One advantage of the long study is the predictive ability of the GPC. The best-fit line provides an estimate of the probability of acceptance of a part with any reference value in or near the expected range of variation. From this, a risk profile can be generated, focusing improvement efforts on projects with the greatest expected value.
Other methods of attribute gauge performance assessment are available, including variations and extensions of those presented here. The techniques described are appropriate to new analysts, or for measurement systems that have been subject to no previous assessment, and can serve as stepping stones to more sophisticated investigations as experience is gained and “low-hanging fruit” is harvested.
JayWink Solutions is available to assist you and your organization with quality and operational challenges. Contact us for an independent review of your situation and action plan proposal.
For a directory of “The War on Error” volumes on “The Third Degree,” see “Vol. I: Welcome to the Army.”
[Link] “Measurement Systems Analysis,” 3ed. Automotive Industry Action Group, 2002.
[Link] “Conducting a Gage R&R.” Jorge G. Tavera Sainz; Six Sigma Forum Magazine, February 2013.
[Link] “Introduction to the Gage R & R.” Wikilean.
[Link] “Attribute Gage R&R.” Samuel E. Windsor; Six Sigma Forum Magazine, August 2003.
[Link] “Cohen's Kappa.” Real Statistics, 2020.
[Link] “Ensuring R&R.” Scott Force; Quality Progress, January 2020.
[Link] “Measurement system analysis with attribute data.” Keith M. Bower; KeepingTAB #35 (Minitab Newsletter), February 2002.
[Link] Creating Quality. William J. Kolarik; McGraw-Hill, Inc., 1995.
Jody W. Phelps, MSc, PMP®, MBA
JayWink Solutions, LLC
If you'd like to contribute to this blog, please email email@example.com with your suggestions.
© JayWink Solutions, LLC