Research report: Building a wide reach corpus for secure parser development

Tim Allison; Wayne Burke; Valentino Constantinou; Edwin Goh; Chris Mattmann; Anastasija Mensikova; Philip Southam; Ryan Stonebraker; Virisha Timmaraju

Conference ProceedingsOPEN ACCESS

Research report: Building a wide reach corpus for secure parser development

Proceedings - 2020 IEEE Symposium on Security and Privacy Workshops, SPW 2020 (2020) 318-326

DOI: 10.1109/SPW50608.2020.00066

5Citations

5Readers

Abstract

Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Allison, T., Burke, W., Constantinou, V., Goh, E., Mattmann, C., Mensikova, A., … Timmaraju, V. (2020). Research report: Building a wide reach corpus for secure parser development. In Proceedings - 2020 IEEE Symposium on Security and Privacy Workshops, SPW 2020 (pp. 318–326). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/SPW50608.2020.00066

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 1

100%

Readers' Discipline

Computer Science 2

67%

Arts and Humanities 1

33%

Research report: Building a wide reach corpus for secure parser development

Abstract

Author supplied keywords

References Powered by Scopus

Detecting malicious javascript in PDF through document instrumentation

Beyond "green buildings:" exploring the effects of Jevons' Paradox on the sustainability of archival practices

ScienceSearch: Enabling search through automatic metadata generation

Cited by Powered by Scopus

Research Report: Building a File Observatory for Secure Parser Development

Research Report: Progress on Building a File Observatory for Secure Parser Development

CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline