Monday, October 26, 2009

Un-cooking the books with Benford’s Law

So, if you take a set of real life numeric data (e.g. sales figures, customer sat scores, baseball game attendance figures) stripped the first digit off each number and counted those up what would the distribution of numbers be? i.e  how many 7s would there be? How many 2s? Would there be a pattern to the distribution? The answer is actually yes. It’s known to many number crunchers as Benford’s Law.

Here's the Wikipedia definition: Benford's law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 almost one third of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty.

Basically the law means that in many real-life (i.e. non-made-up or random) data the distribution of the first digit in a series of numbers will often look very similar to the graph below (also from Wikipedia).

Benford GraphMy CPA brother-in-law refers to “running the Benfords” when on auditing gigs in order to perform a quick acid test on pertinent sets of numbers to see if there is something that warrants further investigation. If the distribution doesn’t look similar to the graph above then he looks a little closer; maybe someone’s been cooking the books.

As DBAs, BI pros (and maybe some former accountants) we have plenty of real world data at our fingertips. Why not test Mr. Benford out to see if our data conforms. Benford’s law states that the data has to be real-life so let’s test it by applying the theory to three different data sets:

  • Randomly generated numbers
  • Made up numbers (AdventureWorksDW2008 sales figures)
  • Real data from a real life data source

 

 

 

 

Random Numbers

I’ve created a simple script to create set of randomly generated numbers and perform a basic Benford analysis of the results. Here’s what it does:

  1. Creates and populates a table variable with 65,536 random numbers
  2. Populates CTE with the random numbers and a separate column holding the first digit of each number

    First Digit
  3. Counts the instances of each digit and returns the results in a table including a simple histogram column for numbers 1-9 (excluding any 0 digit values sometimes returned by my simplistic random number algorithm)

USE tempdb;
GO

DECLARE @RandomNumbers TABLE
(
    RandomNumber INT
);

-- Use Itzik Ben Gan's technique to quickly generate 65536 records
WITH 
  n5(x) AS (SELECT 1 UNION SELECT 0),
  n4(x) AS (SELECT 1 FROM n5 CROSS JOIN n5 AS x),
  n3(x) AS (SELECT 1 FROM n4 CROSS JOIN n4 AS x),
  n2(x) AS (SELECT 1 FROM n3 CROSS JOIN n3 AS x),
  n1(x) AS (SELECT 1 FROM n2 CROSS JOIN n2 AS x)

-- Create 65536 random numbers
INSERT INTO @RandomNumbers
SELECT
    ABS(CHECKSUM(NEWID())) % 100000 AS RandomNumber
FROM n1;

-- CTE containing each random number and its first digit
WITH BenfordTest (FirstDigit, RandomNumber) AS
(
    SELECT
         SUBSTRING(CAST(RandomNumber AS VARCHAR(MAX)), 1,1) AS FirstDigit
        ,RandomNumber
    FROM @RandomNumbers
)

-- Count how many instances of each number there is from 1-9
SELECT
     FirstDigit
    ,COUNT(*) AS InstanceCount
    ,REPLICATE('|', 100. * (CAST(COUNT(*) AS NUMERIC(5,0)) / b2.Total)) AS PctDistribution
FROM BenfordTest b1,
    (SELECT CAST(COUNT(*) AS Numeric(5,0)) Total FROM BenfordTest) b2
WHERE FirstDigit > 0 -- exclude any zero digit records
GROUP BY FirstDigit, b2.Total
ORDER BY FirstDigit ASC;

And here are the results. Because these are random numbers you can see that the PctDistribution column, and the Excel graph I created is pretty much uniformly distributed between about 10.8% and 11.5% for all leading digits.

SQLResultsRandom  GraphRandom

 

Made Up Numbers (AdventureWorksDW2008)

Now we all know that AdventureWorks data is not real, but is there any “realness” to it at all? Probably not. We’ll test against the SalesAmount column in the FactInternetSales table.

BTW, analysis like this is a great excuse to use the TABLESAMPLE clause in order to avoid querying the entire table.

USE AdventureWorksDW2008;
GO

WITH BenfordTest (FirstDigit) AS
(
    SELECT
         SUBSTRING(CAST(SalesAmount AS VARCHAR(MAX)), 1,1) AS FirstDigit
    FROM FactInternetSales
    TABLESAMPLE (20 PERCENT)
)

SELECT
     FirstDigit
    ,COUNT(*) AS InstanceCount
    ,REPLICATE('|', 100. * (CAST(COUNT(*) AS NUMERIC(5,0)) / b2.Total)) AS PctDistribution
FROM BenfordTest b1,
    (SELECT CAST(COUNT(*) AS Numeric(5,0)) Total FROM BenfordTest) b2
GROUP BY FirstDigit, b2.Total
ORDER BY FirstDigit ASC;

OK, I guess that proves just how real AdventureWorks data really is. This provides some pretty good evidence that AW numbers were not generated randomly and were almost certainly not based on real sales figures. On the other hand maybe the regional cycling gear reps are fudging the numbers a bit….
 
 SQLResultsAW GraphAW

Real Numbers

Here are results of a query I ran on some real data I have available to me. Trust me, it is real data but, naturally, I can’t share it. The distribution looks about right, too. Kinda cool, eh?

  SQLResultsRealData GraphRealData

You can try the sample code out yourself on some data that you have access to.

Here’s the code stub:

WITH BenfordTest (FirstDigit) AS
(
    SELECT
         SUBSTRING(CAST(<NumberFieldName> AS VARCHAR(MAX)), 1,1) AS FirstDigit
    FROM <SourceTableName>
    TABLESAMPLE (20 PERCENT)
)

SELECT
     FirstDigit
    ,COUNT(*) AS InstanceCount
    ,REPLICATE('|', 100. * (CAST(COUNT(*) AS NUMERIC(5,0)) / b2.Total)) AS PctDistribution
FROM BenfordTest b1,
    (SELECT CAST(COUNT(*) AS Numeric(5,0)) Total FROM BenfordTest) b2
GROUP BY FirstDigit, b2.Total
ORDER BY FirstDigit ASC

The complete code from this post can be downloaded here.

Wednesday, October 21, 2009

Conditionally Hiding Axes for Trellis Displays

Over the last year MVP Tim Kent has put out a series of really useful posts showing how various data visualizations can be created using SSRS.

The latest post on Trellis displays got me thinking on how I could tweak a few of the settings in Tim’s very useful sample report just a bit more.

In order to show more sample data, I changed the top axis of Tim’s sample report to show sales Bike subcategories because as we all know AdventureWorks sells waaaay more bikes than anything else. Below is a shot of the original report after that change.

TimKentTrellis

I made a few more tweaks and changes and came up with the report below

Trellis

The main ink-saving tip is to conditionally hide / show labels on the X and Y axes based on the items at the top left and bottom left of the trellis. In this case it is Road Bikes and Northeast. All that is needed here is a small amount of extra MDX to ORDER and RANK members in both the Region and Subcategory sets to provide the right meta data required to perform the conditional hide / show. Here is the MDX for the report showing the ordering and ranking of the appropriate sets.

WITH
  SET [SalesOrderedSubcategories] AS
    Order
    (
      [Product].[Product Categories].[Category].[Bikes].Children
     ,[Measures].[Sales Amount]
     ,BDESC
    )
  SET [SalesOrderedRegions] AS
    Order
    (
      [Sales Territory].[Sales Territory].[Region].MEMBERS
     ,[Measures].[Sales Amount]
     ,BDESC
    )
  MEMBER [Measures].[SubcategoryRank] AS
    Rank
    (
      [Product].[Product Categories].CurrentMember
     ,[SalesOrderedSubcategories]
    )
  MEMBER [Measures].[RegionRank] AS
    Rank
    (
      [Sales Territory].[Sales Territory].CurrentMember
     ,[SalesOrderedRegions]
    )
SELECT
  {
    [Measures].[Sales Amount]
   ,[Measures].[SubcategoryRank]
   ,[Measures].[RegionRank]
  } ON COLUMNS
,(
    [SalesOrderedSubcategories]
   ,[SalesOrderedRegions]
   ,[Date].[Calendar Quarter of Year].[Calendar Quarter of Year].MEMBERS
  ) ON ROWS
FROM [Adventure Works];
Set the SORT property of each group to its respective GroupNameRank calculated member, this way we can be sure that in our example the top ranked Subcategory (Road Bikes) will be the left-most item and the lowest ranked Region (Southwest) will be the bottom-most item in the trellis.

Sorting 

The real trick here is to use an expression to conditionally hide / show the axis labels so that we only see the X-axis labels at the bottom of the trellis and Y-axis labels on the left side of the trellis.

AxisLabels

Y-axis uses an expression that only shows the axis for the MIN ranked member for Subcategory

=IIF(
    Fields!SubcategoryRank.Value = MIN(Fields!SubcategoryRank.Value, "Trellis")
    , false
    , true
)

and X-axis uses the MAX ranked member for Region

=IIF(
    Fields!RegionRank.Value = MAX(Fields!RegionRank.Value, "Trellis")
    , false
    , true
)

It also helps to check the Hide first and last labels along this axis for the Y-axis as the zero value tends the throw off the alignment with the other charts where the X-axis is hidden.

HideLablels

The sample .rdl file can be downloaded here.

Friday, October 16, 2009

Windows Problem Steps Recorder

Been meaning to write about this one for some time. This year’s TechEd USA keynote was centered around Windows Server 2008 R2. The speakers demoed plenty of good features but the one that stood out (and easily got the most spontaneous applause) was the Problem Steps Recorder. It does exactly as its name suggests: it records problem steps.

The PSR is a really simple app that is baked into both Windows Server 2008 R2 and Windows 7. It’s actually not that easy to find unless you know what you’re looking for – the Start menu doesn’t even list the app by its name.

StartMenu 

The UI couldn’t be simpler as seen below.

image

The user opens the app, presses Start Record, performs the steps that reproduce a problem they’re experiencing (including their own comments if needed), then presses Stop Record. PSR then asks where they want to save the results. The output produced is a zip archive containing a single MHT file. The user then emails the zip archive to the helpdesk or whoever is trying to help them.

The generated MHT file is a navigable, screenshot-by-screenshot, annotated document of each step the user performed while recording. Below is an example step explanation with screenshot.

image

Here is a text summary of each step without screenshots that is found at the bottom of each MHT file.

image

OK, so why does PSR have to be just a helpdesk tool? Like many others in the audience I thought of this as a potential quick & dirty documentation tool. Admittedly out of the box the text produced describes every step as a “Problem” but once you look past that this could be a great little time saver in a number of situations.

Here’s an example output file that walks through creating an SSRS project and adding a new report in BIDS.

Data Dude Error TSD03006 – Explicit Database Reference

Problem

You’ve just created a new Data Dude project and imported your DB schema, or you’ve just synchronized schemas with an existing project having added several new views / UDFs / SPROCs etc. When you try to build your project you find that there are tons of TSD03006 errors that are stopping you. Why?

TSD03006: View: [dbo].[vFactResellerSales] contains an unresolved reference to an object. Either the object does not exist or the reference is ambiguous because it could refer to any of the following objects: [dbo].[DimProduct].[s]::[ProductKey], [dbo].[FactResellerSales].[ProductKey] or [dbo].[FactResellerSales].[s]::[ProductKey]

TSD03006

Solution

Ensure any explicit database name references do not exist in your code. If you need to explicitly reference a database do it by using variables.

Removing the explicit DB references from the code should make the TDS03006 errors disappear. Admittedly you should be cleaning up any explicit DB references within your code, but when whipping up a new script in the early hours of the morning this may be something that you’d miss. And suddenly getting a whole heap of build-blocking errors at 3am can be incredibly annoying. 

Sometimes explicit DB references can creep into code without you catching it. For example when creating views from large tables I often start by scripting out a SELECT statement within SSMS. The code that is generated by SSMS includes an explicit reference to the database.

SELECT [ProductKey]
       ,[OrderDateKey]
       ,[DueDateKey]
       ,[ShipDateKey]
       ,[ResellerKey]
       -- Shortened for brevity

       ,[CustomerPONumber]
FROM   [AdventureWorksDW2008].[dbo].[FactResellerSales]

It’s the explicit [AdventureWorksDW2008] reference that causes the TSD03006 errors. If you add to the SSMS generated code and join other tables but keep even one explicit DB reference you will continue to receive TSD03006 errors for each field in the view definition.

Data Dude is indeed bringing a valid problem to our attention but I think there should be a more elegant way to communicate the situation. One view containing 37 fields with 1 explicit DB reference returns 37 errors. In the end the fix is easy, but figuring out the fix takes more time than it should based on the content and volume of the error messages.

BTW make sure that you’re using Data Dude GDR R2 version 9.1.40413.00. I found an MS support document that detailed a similar issue pertaining to using Server and Database aliases in referenced projects that is fixed in this release. This issue is similar but different in that it’s not a direct problem with aliases or referenced projects, just careless coding on my part.

Thursday, October 15, 2009

Cloud-based Tabular Data Sources

I realize there are many out there who are sick of the term “the cloud”. Larry Ellison's rant on this topic is great.

Nonetheless I got my Azure invitation yesterday and for no other reason other than it’s geeky, tried to access it from PPS. As I’d hoped, it was simple (as was connecting using SSMS). I set up a sample DB in my allotted condensed water vapor storage area and created a tabular data source using the appropriate connection string. Easy!

ConnectionString

…and here’s the data in it

DataSourcePreview

Hopefully cloud-based SSAS is not far away.

Windows 7 and PerformancePoint M&A Setup Gotcha

For those who have, or are about to, upgrade to Windows 7 and reinstall a local instance of PPS for the purposes of demos / experimentation / learning etc. here are a couple of quick tips on getting things up and running.

  1. Before installing the M&A Server be sure to read this post and ensure all your pre-reqs are in place.
  2. Install the latest hotfix (I understand that PPS SP3 coming out in a couple of days)

UPDATE: SP3 is now available so skip step 2 above and install SP3 instead

x86: http://www.microsoft.com/downloads/details.aspx?FamilyID=90c596a5-aca4-4ded-9072-facf834bc0c6&displaylang=en

x64: http://www.microsoft.com/downloads/details.aspx?FamilyID=3ad75ae5-d2cd-4953-87cf-5f74d79804c6&displaylang=en

And now the Win7 twist. Once you have installed, configured the M&A server and fired up Dashboard Designer (DD) you receive the following message when trying to create a data source:

The requested item cannot be found. Verify that it exists and that you have access permissions.

OK, so this is an application pool account problem? Nope. The problem is you’re not an administrator on this Monitoring Server yet. So you go into the administrative section in DD and add your account. When you to the options section and try to administer the server you get this error:

You do not have Administrator privileges on the Monitoring Server ‘http://localhost:40000/WebService/PMService.axmx’. Contact your Monitoring Server Administrator.

And herein lies the problem. You’re not a monitoring server administrator yet, but how can you make yourself one if you can’t get into the admin screen?

The solution:

You need explicitly run Dashboard Designer as an administrator (“but I AM an administrator!”). By default you’re not running this application as an admin. This is related to the UAC settings in Win7. Sure you can alter (turn off) the UAC settings but you may not want to, or be allowed to, in some cases.

How do you run DD as an admin? The Start menu item that is created for DD is only a link to the ClickOnce launch URL so you won’t find the appropriate “Run as administrator” option on the context menu if you SHIFT + Right-click on it. In order to run DD as an administrator find the DD executable (PSCBuilder.exe) in the file system. On a default install it will be located in %Program Files\Microsoft Office PerformancePoint Server\3.0\Monitoring\PPSMonitoring_1\DesignerInstall\3.0

Right-click the executable and “Run as administrator” from there.

RunAsAdmin

In this administrator instance of Dashboard Designer go into the Server Options section and add yourself as an administrator.

You should be good to go from there.

Sunday, October 11, 2009

Book Review: Expert Cube Development with Analysis Services 2008

Full disclosure: the authors of this book provided me with a free review copy of this book.

ExpertCubeDevelopmentSSAS2008

As the title suggests Expert Cube Development with Microsoft SQL Server 2008 Analysis Services is not a book for SSAS beginners. This book fills a need that has been out there for a while when it comes to Analysis Services publications: a concentrated volume focused on enhancing the knowledge of the experienced SSAS pro. From the outset the authors assume the reader already has experience with the product, cover a few ground rules and get right down to business.

The amount of real world SSAS implementation experience shared between Alberto, Marco & Chris shines through indicating just how much work they’ve collectively done with Analysis Services. Many technical books have advanced sections or whole chapters dedicated to more advanced development techniques and tips. Being pitched as an expert book enables this level of content to pretty much fill the entire publication.

There is a lot of goodness jammed into the book’s 320-ish pages. You can sense that the authors tried hard to fit as many tips, tricks and techniques into each chapter as possible without bloating the text. They do not waste page space explaining the simple stuff because, if you’re reading this book, you should know the simple stuff. Each chapter remains concise and tells you what you need to know and where to go if you want to find out more by means of links to blog posts, white papers and other books as well as downloadable sample code.

My only criticism is a somewhat superficial one and is probably directed more at the book’s editor than its authors. There were no reference numbers and caption text underneath any of the screenshots, tables and figures at all. The non-textual items seemed naked without them and this made the end product seem a little less polished. As a reader I prefer it when the text points specifically to “Figure 1.2” instead of “the screenshot”. On some occasions the text didn’t even make direct reference to the item that appeared on the page with it, the relationship was implied by proximity.

If you have not worked much with SSAS yet then this is not a book you should be starting with. If, however, you’ve been working with the product and want to ensure you’re squeezing every last bit of performance out of your OLAP databases, this is a book you’ll want to read cover to cover. Even the most seasoned SSAS experts will come across material or techniques they did not know of or had forgotten about.