Auditing Section Research Summaries Space

A Database of Auditing Research - Building Bridges with Practice

This is a public Custom Hive  public

research summary

    Accounting Variables, Deception, and a Bag of Words:...
    research summary posted October 20, 2015 by Jennifer M Mueller-Phillips, tagged 06.0 Risk and Risk Management, Including Fraud Risk, 06.01 Fraud Risk Assessment, 06.02 Fraud Risk Models 
    Accounting Variables, Deception, and a Bag of Words: Assessing the Tools of Fraud Detection.
    Practical Implications:

    This paper presents a fraud-detection tool developed based on textual analysis of the MD&A sections in public companies’ annual and quarterly reports. This tool correctly classifies reports into truthful and fraudulent more than 82% of the time. Compared with other fraud-detection approaches documented in prior literature, this tool has the highest predictive power for both annual reports and quarterly reports. Using the tool to analyze a sequence of reports of a company further increases the accuracy of predictions. This paper provides insights for regulators and practitioners in designing fraud-detection tools. As the tool is “trained” using the AAER database, one limitation is the tool may not detect fraudulent reports if the SEC fails to discover certain types of frauds and/or has bias in selecting firms to investigate.     


    Purda, L. and D. Skillicorn. 2015. Accounting Variables, Deception, and a Bag of Words: Assessing the Tools of Fraud Detection. Contemporary Accounting Research 32 (3): 11931223.

    corporate fraud, financial disclosure, textual analysis
    Purpose of the Study:

    There are many tools developed by academia, audit firms and regulators to detect accounting frauds in the U.S. This paper is to demonstrate that the changes in writing and presentation style in the management discussion and analysis (MD&A) section as captured by a data-generated language tool has high predictive power over frauds. Another purpose of this study is to compare the effectiveness of various fraud-detection tools, including the financial, language-based, and nonfinancial fraud-detection tools, and to analyze the correlations among them. The findings will help investors, regulators and practitioners to select effective tools and use the tools in optimal ways.

    Design/Method/ Approach:

    The authors adopt a bag of word methodology to develop the fraud-detection tool. In contrast to other language-based tools developed through the same methodology, this tool does not use a list of ex-ante identified predictive words. Rather, the authors use data to generate a bag of words. First, the authors extract a sample of annual and quarterly reports for the period from 1999 to 2006 from the EDGAR database. Second, they use the SEC’s AAER bulletins to identify which reports are fraudulent. The truthful and fraudulent reports comprise a database which is used to “train” the tool to identify the subtle relationships between words in the MD&A sections and the fraudulent reports. Third, the authors use the decision tree approach to create a list of top 200 words ranked by their abilities to identify fraudulent reports. Based on the list, they build a model to calculate the probability of truthful reporting for each report.

    • The Receiver Operating Characteristic (ROC) area is a statistic number range from 0 to 1 and is used to assess the overall ability of a model to correctly differentiate truth from false. The ROC area of this tool is 0.89, which is significantly higher than the 0.5 benchmark and is also higher than the ROC areas of alternative fraud-detection tools.  
    • The F-score from Dechow et al. (2011) is the second best fraud-detection tool in terms of the ROC area. The authors find the F-score, a financial-based tool, can be used as complements to their language-based tool.
    • The authors find their tool has an advantage to predict fraudulent interim reports. Through time-series analysis, the authors find a decline in probability of truthful reporting in the two quarters preceding the fraud. They also find including the change in probability significantly increase the predictive power of the model.
    Risk & Risk Management - Including Fraud Risk
    Fraud Risk Assessment, Fraud Risk Models