Un-cooking the books with Benford’s Law
So, if you take a set of real life numeric data (e.g. sales figures, customer sat scores, baseball game attendance figures) stripped the first digit off each number and counted those up what would the distribution of numbers be? i.e how many 7s would there be? How many 2s? Would there be a pattern to the distribution? The answer is actually yes. It’s known to many number crunchers as Benford’s Law.
Here's the Wikipedia definition: Benford's law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 almost one third of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty.
Basically the law means that in many real-life (i.e. non-made-up or random) data the distribution of the first digit in a series of numbers will often look very similar to the graph below (also from Wikipedia).
My CPA brother-in-law refers to “running the Benfords” when on auditing gigs in order to perform a quick acid test on pertinent sets of numbers to see if there is something that warrants further investigation. If the distribution doesn’t look similar to the graph above then he looks a little closer; maybe someone’s been cooking the books.
As DBAs, BI pros (and maybe some former accountants) we have plenty of real world data at our fingertips. Why not test Mr. Benford out to see if our data conforms. Benford’s law states that the data has to be real-life so let’s test it by applying the theory to three different data sets:
- Randomly generated numbers
- Made up numbers (AdventureWorksDW2008 sales figures)
- Real data from a real life data source
Random Numbers
I’ve created a simple script to create set of randomly generated numbers and perform a basic Benford analysis of the results. Here’s what it does:
- Creates and populates a table variable with 65,536 random numbers
- Populates CTE with the random numbers and a separate column holding the first digit of each number
- Counts the instances of each digit and returns the results in a table including a simple histogram column for numbers 1-9 (excluding any 0 digit values sometimes returned by my simplistic random number algorithm)
USE tempdb;
GODECLARE @RandomNumbers TABLE
(
RandomNumber INT
);-- Use Itzik Ben Gan's technique to quickly generate 65536 records
WITH
n5(x) AS (SELECT 1 UNION SELECT 0),
n4(x) AS (SELECT 1 FROM n5 CROSS JOIN n5 AS x),
n3(x) AS (SELECT 1 FROM n4 CROSS JOIN n4 AS x),
n2(x) AS (SELECT 1 FROM n3 CROSS JOIN n3 AS x),
n1(x) AS (SELECT 1 FROM n2 CROSS JOIN n2 AS x)-- Create 65536 random numbers
INSERT INTO @RandomNumbers
SELECT
ABS(CHECKSUM(NEWID())) % 100000 AS RandomNumber
FROM n1;-- CTE containing each random number and its first digit
WITH BenfordTest (FirstDigit, RandomNumber) AS
(
SELECT
SUBSTRING(CAST(RandomNumber AS VARCHAR(MAX)), 1,1) AS FirstDigit
,RandomNumber
FROM @RandomNumbers
)-- Count how many instances of each number there is from 1-9
SELECT
FirstDigit
,COUNT(*) AS InstanceCount
,REPLICATE('|', 100. * (CAST(COUNT(*) AS NUMERIC(5,0)) / b2.Total)) AS PctDistribution
FROM BenfordTest b1,
(SELECT CAST(COUNT(*) AS Numeric(5,0)) Total FROM BenfordTest) b2
WHERE FirstDigit > 0 -- exclude any zero digit records
GROUP BY FirstDigit, b2.Total
ORDER BY FirstDigit ASC;
And here are the results. Because these are random numbers you can see that the PctDistribution column, and the Excel graph I created is pretty much uniformly distributed between about 10.8% and 11.5% for all leading digits.
Made Up Numbers (AdventureWorksDW2008)
Now we all know that AdventureWorks data is not real, but is there any “realness” to it at all? Probably not. We’ll test against the SalesAmount column in the FactInternetSales table.
BTW, analysis like this is a great excuse to use the TABLESAMPLE clause in order to avoid querying the entire table.
USE AdventureWorksDW2008;
GOWITH BenfordTest (FirstDigit) AS
(
SELECT
SUBSTRING(CAST(SalesAmount AS VARCHAR(MAX)), 1,1) AS FirstDigit
FROM FactInternetSales
TABLESAMPLE (20 PERCENT)
)SELECT
FirstDigit
,COUNT(*) AS InstanceCount
,REPLICATE('|', 100. * (CAST(COUNT(*) AS NUMERIC(5,0)) / b2.Total)) AS PctDistribution
FROM BenfordTest b1,
(SELECT CAST(COUNT(*) AS Numeric(5,0)) Total FROM BenfordTest) b2
GROUP BY FirstDigit, b2.Total
ORDER BY FirstDigit ASC;
Real Numbers
Here are results of a query I ran on some real data I have available to me. Trust me, it is real data but, naturally, I can’t share it. The distribution looks about right, too. Kinda cool, eh?
You can try the sample code out yourself on some data that you have access to.
Here’s the code stub:
WITH BenfordTest (FirstDigit) AS
(
SELECT
SUBSTRING(CAST(<NumberFieldName> AS VARCHAR(MAX)), 1,1) AS FirstDigit
FROM <SourceTableName>
TABLESAMPLE (20 PERCENT)
)SELECT
FirstDigit
,COUNT(*) AS InstanceCount
,REPLICATE('|', 100. * (CAST(COUNT(*) AS NUMERIC(5,0)) / b2.Total)) AS PctDistribution
FROM BenfordTest b1,
(SELECT CAST(COUNT(*) AS Numeric(5,0)) Total FROM BenfordTest) b2
GROUP BY FirstDigit, b2.Total
ORDER BY FirstDigit ASC
The complete code from this post can be downloaded here.
2 comments:
Nice, I had forgotten about Benford's Law. If you don't mind I'm going to write about your blog post over on blog.datadrivenconsulting.com
Thanks Alex. Please go right ahead, reference whatever you want :)
Cheers,
Nick
Post a Comment