Speaking the language of business intelligence with an Australian accent

Monday, October 26, 2009

Un-cooking the books with Benford’s Law

So, if you take a set of real life numeric data (e.g. sales figures, customer sat scores, baseball game attendance figures) stripped the first digit off each number and counted those up what would the distribution of numbers be? i.e  how many 7s would there be? How many 2s? Would there be a pattern to the distribution? The answer is actually yes. It’s known to many number crunchers as Benford’s Law.

Here's the Wikipedia definition: Benford's law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 almost one third of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty.

Basically the law means that in many real-life (i.e. non-made-up or random) data the distribution of the first digit in a series of numbers will often look very similar to the graph below (also from Wikipedia).

Benford GraphMy CPA brother-in-law refers to “running the Benfords” when on auditing gigs in order to perform a quick acid test on pertinent sets of numbers to see if there is something that warrants further investigation. If the distribution doesn’t look similar to the graph above then he looks a little closer; maybe someone’s been cooking the books.

As DBAs, BI pros (and maybe some former accountants) we have plenty of real world data at our fingertips. Why not test Mr. Benford out to see if our data conforms. Benford’s law states that the data has to be real-life so let’s test it by applying the theory to three different data sets:

  • Randomly generated numbers
  • Made up numbers (AdventureWorksDW2008 sales figures)
  • Real data from a real life data source

 

 

 

 

Random Numbers

I’ve created a simple script to create set of randomly generated numbers and perform a basic Benford analysis of the results. Here’s what it does:

  1. Creates and populates a table variable with 65,536 random numbers
  2. Populates CTE with the random numbers and a separate column holding the first digit of each number

    First Digit
  3. Counts the instances of each digit and returns the results in a table including a simple histogram column for numbers 1-9 (excluding any 0 digit values sometimes returned by my simplistic random number algorithm)

USE tempdb;
GO

DECLARE @RandomNumbers TABLE
(
    RandomNumber INT
);

-- Use Itzik Ben Gan's technique to quickly generate 65536 records
WITH 
  n5(x) AS (SELECT 1 UNION SELECT 0),
  n4(x) AS (SELECT 1 FROM n5 CROSS JOIN n5 AS x),
  n3(x) AS (SELECT 1 FROM n4 CROSS JOIN n4 AS x),
  n2(x) AS (SELECT 1 FROM n3 CROSS JOIN n3 AS x),
  n1(x) AS (SELECT 1 FROM n2 CROSS JOIN n2 AS x)

-- Create 65536 random numbers
INSERT INTO @RandomNumbers
SELECT
    ABS(CHECKSUM(NEWID())) % 100000 AS RandomNumber
FROM n1;

-- CTE containing each random number and its first digit
WITH BenfordTest (FirstDigit, RandomNumber) AS
(
    SELECT
         SUBSTRING(CAST(RandomNumber AS VARCHAR(MAX)), 1,1) AS FirstDigit
        ,RandomNumber
    FROM @RandomNumbers
)

-- Count how many instances of each number there is from 1-9
SELECT
     FirstDigit
    ,COUNT(*) AS InstanceCount
    ,REPLICATE('|', 100. * (CAST(COUNT(*) AS NUMERIC(5,0)) / b2.Total)) AS PctDistribution
FROM BenfordTest b1,
    (SELECT CAST(COUNT(*) AS Numeric(5,0)) Total FROM BenfordTest) b2
WHERE FirstDigit > 0 -- exclude any zero digit records
GROUP BY FirstDigit, b2.Total
ORDER BY FirstDigit ASC;

And here are the results. Because these are random numbers you can see that the PctDistribution column, and the Excel graph I created is pretty much uniformly distributed between about 10.8% and 11.5% for all leading digits.

SQLResultsRandom  GraphRandom

 

Made Up Numbers (AdventureWorksDW2008)

Now we all know that AdventureWorks data is not real, but is there any “realness” to it at all? Probably not. We’ll test against the SalesAmount column in the FactInternetSales table.

BTW, analysis like this is a great excuse to use the TABLESAMPLE clause in order to avoid querying the entire table.

USE AdventureWorksDW2008;
GO

WITH BenfordTest (FirstDigit) AS
(
    SELECT
         SUBSTRING(CAST(SalesAmount AS VARCHAR(MAX)), 1,1) AS FirstDigit
    FROM FactInternetSales
    TABLESAMPLE (20 PERCENT)
)

SELECT
     FirstDigit
    ,COUNT(*) AS InstanceCount
    ,REPLICATE('|', 100. * (CAST(COUNT(*) AS NUMERIC(5,0)) / b2.Total)) AS PctDistribution
FROM BenfordTest b1,
    (SELECT CAST(COUNT(*) AS Numeric(5,0)) Total FROM BenfordTest) b2
GROUP BY FirstDigit, b2.Total
ORDER BY FirstDigit ASC;

OK, I guess that proves just how real AdventureWorks data really is. This provides some pretty good evidence that AW numbers were not generated randomly and were almost certainly not based on real sales figures. On the other hand maybe the regional cycling gear reps are fudging the numbers a bit….
 
 SQLResultsAW GraphAW

Real Numbers

Here are results of a query I ran on some real data I have available to me. Trust me, it is real data but, naturally, I can’t share it. The distribution looks about right, too. Kinda cool, eh?

  SQLResultsRealData GraphRealData

You can try the sample code out yourself on some data that you have access to.

Here’s the code stub:

WITH BenfordTest (FirstDigit) AS
(
    SELECT
         SUBSTRING(CAST(<NumberFieldName> AS VARCHAR(MAX)), 1,1) AS FirstDigit
    FROM <SourceTableName>
    TABLESAMPLE (20 PERCENT)
)

SELECT
     FirstDigit
    ,COUNT(*) AS InstanceCount
    ,REPLICATE('|', 100. * (CAST(COUNT(*) AS NUMERIC(5,0)) / b2.Total)) AS PctDistribution
FROM BenfordTest b1,
    (SELECT CAST(COUNT(*) AS Numeric(5,0)) Total FROM BenfordTest) b2
GROUP BY FirstDigit, b2.Total
ORDER BY FirstDigit ASC

The complete code from this post can be downloaded here.

2 comments:

Alex Kerin said...

Nice, I had forgotten about Benford's Law. If you don't mind I'm going to write about your blog post over on blog.datadrivenconsulting.com

Nick Barclay said...

Thanks Alex. Please go right ahead, reference whatever you want :)

Cheers,
Nick