Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports

  • Tixier A
  • Hallowell M
  • Rajagopalan B
 et al. 
  • 88


    Mendeley users who have this article in their library.
  • 17


    Citations of this article.


In the United States like in many other countries throughout the globe, construction workers are more likely to be injured on the job than workers in any other industry. This poor safety performance is responsible for huge human and financial losses and has motivated extensive research. Unfortunately, safety improvement in construction has decelerated in the last decade and traditional safety programs have reached saturation. Yet major construction companies and federal agencies possess a wealth of empirical knowledge in the form of huge databases of digital construction injury reports. This knowledge could be used to better understand, predict, and prevent the occurrence of construction accidents. Unfortunately, due to the lack of a clear methodology and the high costs of manual large-scale content analysis, these valuable data have yet to be extracted and leveraged. Recently, researchers have proposed a framework allowing meaningful empirical data to be extracted from accident reports. However, the resource limitations inherent to manual content analysis still remain. The present study tested the proposition that manual content analysis of injury reports can be eliminated using natural language processing (NLP). This paper describes (1) the overall strategy and methodology used in developing the system, and specifically how key challenges with decoding unstructured reports were overcome; (2) how the system was built through an iterative process of coding and testing against manual content analysis results from a team of seven independent analysts; and (3) the implications and potential uses of the data extracted. The results indicate that the NLP system is capable of quickly and automatically scanning unstructured injury reports for 101 attributes and outcomes with over 95% accuracy. The main contribution of this research is to empower any organization to quickly obtain a large and highly reliable structured attribute and outcome data set from their databases of unstructured accident reports. Such structured data are a necessary prerequisite to the application of statistical modeling techniques, allowing the extraction of new safety knowledge and finally the amelioration of safety management.

Author-supplied keywords

  • Accident
  • Attribute
  • Automated content analysis
  • Injury
  • Knowledge extraction
  • Natural language processing
  • R
  • Risk
  • Safety
  • Text mining

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • Antoine J.P. Tixier

  • Matthew R. Hallowell

  • Balaji Rajagopalan

  • Dean Bowman

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free