Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies.

Academic Article


  • BACKGROUND: Mapping job titles to standardised occupation classification (SOC) codes is an important step in identifying occupational risk factors in epidemiological studies. Because manual coding is time-consuming and has moderate reliability, we developed an algorithm called SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiologic Research) to assign SOC-2010 codes based on free-text job description components. METHODS: Job title and task-based classifiers were developed by comparing job descriptions to multiple sources linking job and task descriptions to SOC codes. An industry-based classifier was developed based on the SOC prevalence within an industry. These classifiers were used in a logistic model trained using 14 983 jobs with expert-assigned SOC codes to obtain empirical weights for an algorithm that scored each SOC/job description. We assigned the highest scoring SOC code to each job. SOCcer was validated in 2 occupational data sources by comparing SOC codes obtained from SOCcer to expert assigned SOC codes and lead exposure estimates obtained by linking SOC codes to a job-exposure matrix. RESULTS: For 11 991 case-control study jobs, SOCcer-assigned codes agreed with 44.5% and 76.3% of manually assigned codes at the 6-digit and 2-digit level, respectively. Agreement increased with the score, providing a mechanism to identify assignments needing review. Good agreement was observed between lead estimates based on SOCcer and manual SOC assignments (κ 0.6-0.8). Poorer performance was observed for inspection job descriptions, which included abbreviations and worksite-specific terminology. CONCLUSIONS: Although some manual coding will remain necessary, using SOCcer may improve the efficiency of incorporating occupation into large-scale epidemiological studies.
  • Authors

  • Russ, Daniel E
  • Ho, Kwan-Yuet
  • Colt, Joanne S
  • Armenti, Karla
  • Baris, Dalsu
  • Chow, Wong-Ho
  • Davis, Faith
  • Johnson, Alison
  • Purdue, Mark P
  • Karagas, Margaret R
  • Schwartz, Kendra
  • Schwenn, Molly
  • Silverman, Debra T
  • Johnson, Calvin A
  • Friesen, Melissa C
  • Status

    Publication Date

  • June 2016
  • Keywords

  • Algorithms
  • Carcinoma, Renal Cell
  • Case-Control Studies
  • Computers and information technology < Methodology
  • Epidemiologic Methods
  • Epidemiologic Studies
  • Humans
  • Industry
  • Job Description
  • Logistic Models
  • Natural Language Processing
  • Occupations
  • Reproducibility of Results
  • Software
  • United States
  • United States Occupational Safety and Health Administration
  • speciality
  • Digital Object Identifier (doi)

    Start Page

  • 417
  • End Page

  • 424
  • Volume

  • 73
  • Issue

  • 6