Examining the Fairness of Language Test Across Gender with IRT-based Differential Item and Test Functioning Methods

Burhanettin Ozdemir, Abdulrahman Hadi Alshamrani

Abstract


Test fairness is an important indicator of the validity of test results. The fairness and equity require ensuring that the background characteristics of test-takers, such as ethnicity and gender, do not affect their test scores. Differential item functioning (DIF) methods are commonly used to detect potentially biased items that lead to the unfair assessment of the performance of test-takers with the same ability levels coming from the different cultural, social, demographic, and linguistic backgrounds. This study aims at detecting potentially biased items across gender and examining their effect on test scores to ensure the fairness of test results for each domain and the entire test. Item response theory (IRT) based Lord’s chi-square DIF method at item level and Mantel-Haenszel/Liu-Agresti differential test functioning (DTF) method at test level were implemented to the English Placement Tests (EPT) administered to high school graduates by the National Center for Assessment. The results show that 6 items of the EPT exhibit DIF for the entire test. Two of them are related to reading comprehension and four to the structure domain, while none of the compositional analysis methods shows DIF. These results indicate the existence of content specific DIF effect. Additionally, two items exhibit uniform DIF, one of which shows DIF favoring male students and the offer favoring female students. The small to moderate DTF effect associated with sub-domains and the entire test imply that DIF effects cancel each out, assuring the fairness of results at test level. However, the items with substantially high DIF values need to be examined by content experts to determine the possible cause of DIF effects to avoid gender bias and unfair test outcomes. We also suggest conducting further studies to investigate the reasons behind the content specific DIF effects in language tests.

https://doi.org/10.26803/ijlter.19.6.2


Keywords


test fairness and validity; gender bias; language testing; differential item functioning; differential test functioning

Full Text:

PDF

References


Anastasi, A., & Urbina, S. (1996). Psychological Testing. Upper Saddle River, NJ: Prentice Hall.

Aryadoust, V., & Zhang, L. (2016). Fitting the mixed Rasch model to a reading comprehension test: Exploring individual difference profiles in L2 reading. Language Testing, 33(4), 529-553. doi:10.1177/0265532215594640

Borsboom, D. (2006). When does measurement invariance matter? Medical Care, 44(11), 176-181. doi:10.1097/01.mlr.0000245143.08679.cc

Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 221-256). Westport: American Council on Education & Praeger Publishers.

Camilli, G., & Penfield, D. (1997). Variance estimation for differential test functioning based on Mantel-Haenszel statistics. Journal of Educational Measurement, 34(2), 123-139. doi:10.1111/j.1745-3984.1997.tb00510.x

Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage publications.

Chubbuck, K., Curley, W. E., & King, T. C. (2016). Who’s on first? Gender differences in performance on the SAT test on critical reading items with sports and science content (Report No. RR-16-26). Princeton, NJ: Educational Testing Service.

Clauser, B., & Mazor, K. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-44. doi:10.1111/j.1745-3992.1998.tb00619.x

Donovan, M. A., Drasgow, F., & Probst, T. M. (2000). Does computerizing paper-and-pencil job attitude scales make a difference? New IRT analyses offer insight. Journal of Applied Psychology, 85(2), 305-313. doi:10.1037/0021-9010.85.2.305

Drabinová, A., & Martinková, P. (2016). Detection of differential item functioning with non-linear regression: Non-IRT approach accounting for guessing. Retrieved from http://hdl.handle.net/11104/0259498

Ellis, B. B., & Mead, A. D. (2000). Assessment of the measurement equivalence of a Spanish translation of the 16PF questionnaire. Educational and Psychological Measurement, 60(5), 787-807. doi:10.1177/00131640021970781

Ellis, B., & Raju, N. (2003). Test and item bias: What they are, what they aren’t, and how to measure them. In J. E. Wall & G. R. Walz (Eds.), Measuring up: Assessment issues for teachers, counselors, and administrators. (pp. 89-98). Greensboro, N.C.: CAPS.

Ercikan, K., Arim, R., Law, D., Domene, J., Gagnon F., & Lacroix S. (2010). Application of think aloud protocols for examining and confirming sources of differential item functioning identified by expert reviews. Educational Measurement: Issues and Practice, 29(2), 24-35. doi:10.1111/j.1745-3992.2010.00173.x

Education & Training Evaluation Commission. (2020). Language Test. Retrieved from https://etec.gov.sa/en/productsandservices/Qiyas/lingual/Pages/default.aspx

Evers, A., Muñiz, J., Hagemeister, C., Høstmælingen, A., Lindley, P., Sjöberg, A., & Bartram, D. (2013). Assessing the quality of tests: Revision of the EFPA review model. Psicothema, 25(3), 283-291. doi:10.7334/psicothema2013.97

Federer, M. R., Nehm, R. H., & Pearl, D. K. (2016). Examining gender differences in written assessment tasks in biology: a case study of evolutionary explanations. CBE—Life Sciences Education, 15(1), ar2. doi:10.1187/cbe.14-01-0018

Ferne, T., & Rupp, A. A. (2007). A Synthesis of 15 Years of Research on DIF in Language Testing: Methodological Advances, Challenges, and Recommendations. Language Assessment Quarterly, 4(2), 113–148. doi:10.1080/15434300701375923

Flora, D., Curran, P., Hussong, A., & Edwards, M. (2008). Incorporating measurement nonequivalence in a cross-study latent growth curve analysis. Structural Equation Modeling, 15(4), 676-704. doi:10.1080/10705510802339080

Gierl, M., Bisanz, J., Bisanz, G., Boughton, K., & Khaliq, S. (2001). Illustrating the utility of differential bundle functioning analyses to identify and interpret group differences on achievement tests. Educational Measurement: Issues and Practice, 20(2), 26-36. doi:10.1111/j.1745-3992.2001.tb00060.x

Hambleton, R. K. (2006). Good practices for identifying differential item functioning. Medical Care, 44(11), 182-188. doi:10.1097/01.mlr.0000245443.86671.c4

Hambleton, R. K., Clauser, B. E., Mazor, K. M., & Jones, R. W. (1993). Advances in the detection of differentially functioning test items. European Journal of Psychological Assessment, 9, 1-18. Retrieved from https://eric.ed.gov/?id=ED356264

Hambleton, R. K., & Rogers, H. J. (1989). Detecting potentially biased test items: Comparison of IRT and Mantel-Haenszel methods. Applied Measurement in Education, 2(4), 313-34. doi:10.1207/s15324818ame0204_4

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park. Calif.: Sage Publications.

Hernández, A., Tomás, I., Ferreres, A., & Lloret, S. (2015). Tercera evaluación de test editados en España [Third evaluation of tests published in Spain]. Papeles del Psicólogo, 36(1), 1-8. Retrieved from http://www.papelesdelpsicologo.es/pdf/2484.pdf

Hope, D., Adamson, K., McManus, I. C., Chris, L., & Elder, A. (2018). Using differential item functioning to evaluate potential bias in a high stakes postgraduate knowledge based assessment. BMC Medical Education, 18, 1-7. doi:10.1186/s12909-018-1143-0

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55. doi:10.1080/10705519909540118

Hunter, C. (2014). A simulation study comparing two methods of evaluating differential test functioning (DTF): DFIT and the Mantel-Haenszel/Liu-Agresti variance (Doctoral Dissertation). Georgia State University, Atlanta, GA, United States.

Jang, E. E., & Roussos, L. (2009). Integrative analytic approach to detecting and interpreting L2 vocabulary DIF. International Journal of Testing, 9(3), 238–259. doi:15305050903107022.

Lai, J. S., Teresi, J., & Gershon, R. (2005). Procedures for the analysis of differential item functioning (DIF) for small sample sizes. Evaluation & the Health Professions, 28(3), 283-294. doi:10.1177/0163278705278276

Lin, J., & Wu, F. (2003). Differential performance by gender in foreign language testing [Poster presentation]. The annual meeting of the National Council on Measurement in Education, Chicago, United States.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

Luo, Y., & Al-Harbi, K. (2016). The Utility of the bifactor method for unidimensionality assessment when other methods disagree: An empirical illustration. Sage Open, 6(4), 1-7. doi:10.1177/2158244016674513

Magis, D., & Facon, B. (2012). Angoff’s Delta method revisited: improving the DIF detection under small samples. British Journal of Mathematical and Statistical Psychology, 65(2), 302-321. doi:10.1111/j.2044-8317.2011.02025.x

Martinková, P., Drabinová, A., Liaw, Y., Sanders, E. A., McFarland, J. L., & Price, R. M. (2017). Checking Equity: Why Differential Item Functioning Analysis Should Be a Routine Part of Developing Conceptual Assessments. CBE—Life Sciences Education, 16(2), 1-13. doi:10.1187/cbe.16-10-0307

Millsap, R. E. (2006). Comments on methods for the investigation of measurement bias in the Mini-Mental State Examination. Medical Care, 44(11), 171-175. doi:10.1097/01.mlr.0000245441.76388.ff

Millsap, R., & Everson, H. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17(4), 297-334. doi:10.1177/014662169301700401

Nandakumar, R. (1993). Simultaneous DIF amplification and cancellation: Shealy-Stout’s test for DIF. Journal of Educational Measurement, 30(4), 293-311. doi:10.1111/j.1745-3984.1993.tb00428.x

Pae, T. (2012). Causes of gender DIF on an EFL language test: A multiple data analysis over nine years. Language Testing, 29(4), 533–554. doi:10.1177/0265532211434027

Pae, T., & Park, G. (2006). Examining the relationship between differential item functioning and differential test functioning. Language Testing, 23(4), 475-496.

Penfield, R. D. (2005). DIFAS: Differential item functions analysis system. Applied Psychological Measurement, 29(2), 150-151. doi:10.1177/0146621603260686

Penfield, R. (2013). DIFAS 5.0: Differential item functions analysis system. User’s manual. Retrieved from https://soe.uncg.edu/wp-content/uploads/2015/12/DIFASManual_V5.pdf

Penfield, R., & Algina, J. (2006). A generalized DIF effect variance estimator for measuring unsigned differential test functioning in mixed format tests. Journal of Educational Measurement, 43(4), 295-312. doi:10.1111/j.1745-3984.2006.00018.x

Penfield, R., & Lee, O. (2010). Test-based accountability: potential benefits and pitfalls of science assessment with student diversity. Journal of Research in Science Teaching, 47(1), 6-24. doi:10.1002/tea.20307

Raju, N., & Ellis, B. (2003). Differential item and test functioning. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing behaviour in organizations: Advances in measurement and data analysis. (pp. 156-188). San Francisco: Jossey-Bass.

Raju, N., van der Linden, W., & Fleer, P. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353-368. doi:10.1177/014662169501900405

Roznowski, M., & Reith, J. (1999). Examining the measurement quality of tests containing differentially functioning items: Do biased items result in poor measurement? Educational and Psychological Measurement, 59(2), 248-269. doi:10.1177/00131649921969839

Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/dif from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194. doi:10.1007/BF02294572

Shepard, L. A., Camilli, G., & Williams, A. F. (1985). Validity of approximation techniques for detecting item bias. Journal of Educational Measurement, 22(2), 77-105. doi:10.1111/j.1745-3984.1985.tb01050.x

Siegel, M. A. (2007). Striving for equitable classroom assessments for linguistic minorities: Strategies for and effects of revising life science items. Journal of Research in Science Teaching, 44(6), 864–881. doi:10.1002/tea.20176

Stage, C. (2005). Socialgruppsskillnader i resultat på högskoleprovet [Social group differences in scores on the Swedish Scholastic Assessment Test]. (Report No. BVM 11:2005). Umeå: Umeå University.

Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2 vocabulary test. Language Testing, 117(3), 323-340. doi:10.1191/026553200678030346

Wedman, J. (2018). Reasons for Gender-Related Differential Item Functioning in a College Admissions Test. Scandinavian Journal of Educational Research, 62(6), 959-970, doi:10.1080/00313831.2017.1402365

Wiberg, M. (2006). Gender differences in the Swedish driving-license test. Journal of Safety Research, 37(3), 285-291. doi:10.1016/j.jsr.2006.02.005

Zhu, X., & Aryadoust, V. (2020). An investigation of mother tongue differential item functioning in a high-stakes computerized academic reading test. Computer Assisted Language Learning, 33, 1-24. doi:10.1080/09588221.2019.17047884

Zumbo, B. D. (2003). Does item-level DIF manifest itself in scale-level analyses? Implications for translating language tests. Language Testing, 20(2), 136-147. doi:10.1191/0265532203lt248oa

Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223–233. doi:10.1080/15434300701375832


Refbacks

  • There are currently no refbacks.


e-ISSN: 1694-2116

p-ISSN: 1694-2493