Saturday, January 30, 2010

A cool SQL trace analyzer

My primary business is load testing and performance characterization but occasionally I am asked to analyze production instances that are behaving badly. This is a lot easier if you have load tested the application beforehand and know it's characteristics. Recently I was asked to look into a poorly performing application that my client's client implemented and then augmented with a custom web service that allows other devices to interface to the system. I'm being a little vague here to protect my client's privacy, but application details are not important to the story. So part of the application I know well, and part is a complete mystery. The web services connect directly to the database that I know well.

Fast forward to my first live session on the production instance, and it is apparent that this system is behaving differently than those I have tested. In my prior experiences with the application, a multi-tiered architecture, the database server was lightly used. On this instance, a 3 year old 8 core DB server was averaging 70-80% CPU busy which was an order of magnitude greater than what I would expect. The other tiers were relatively lightly loaded, so I focused on the database server (MS SQL Server 2005).

SQL server reports for top queries by total CPU and top query by average CPU showed one stored procedure using cursor processing was responsible for the majority of the CPU utilization. The stored procedure was modified to use more set processing and a few weeks later was QA'd and put into production. The result was disappointing. CPU utilization was barely changed. Whats going on here? Performance reports show that the stored procedure is now consuming 1/3 of the CPU resources it used to, but it is still the top consumer of CPU and CPU is still averaging about 70% busy.

One possible explanation is that demand so far exceeded capacity that by tuning the stored procedure I increased the throughput of the the system but the primary bottleneck remained DB server CPU. I have no way to measure throughput or overall system response times, so this will remain a theory at best. Still, I have a nagging feeling that I am missing something important. I decide to take a deeper look by generating a trace of activity using SQL profiler. This is a production system so I start by taking a short snapshot during a slow period of activity. There are 100s of spids (database sessions) active and I am flooded with trace information.

After sorting the data by CPU used or logical I/Os performed, I keep coming back to stored procedure we have already optimized. On whim, I started Googling looking for ideas or tools to help analyze the mountain of trace data before me. I stumbled onto a tool, Trace Analyzer from DBSophic that was available for free. OK, well the price is right so lets give it a try.

I downloaded the tool and pointed it at my trace files (about 50 MB worth of trace data) and a few minutes later the answer was staring me in the face. A query that never showed up on any of the reports bubbled to the top of the list.



After sorting by CPU I see that 95% of all CPU time is due to one query. This can not be a coincidence, but why didn't I notice this before? Both the top query by average CPU and top query by TOTAL CPU reports never mentioned this query. The reason why is pretty simple. While there were tens of thousands of these queries, they were similar, not identical. The selected columns , tables queried, and the the columns used in the where criteria were identical. The values used in the where criteria were all different. While the queries
SELECT someColumns FROM TSDBA.DP_ORDER_EXTENDED_VIEW WHERE REGION_ID = 'ABC' AND ROUTE_DATE = {TS '2010-01-28 00:00:00'} AND ROUTE_ID = 'DEF' AND INTERNAL_STOP_ID = 1
and
SELECT someColumns FROM TSDBA.DP_ORDER_EXTENDED_VIEW WHERE REGION_ID = 'GHI' AND ROUTE_DATE = {TS '2010-01-28 00:00:00'} AND ROUTE_ID = 'JKL' AND INTERNAL_STOP_ID = 2

are viewed as separate queries by the SQL Server's built in performance reports, the trace analyzer from DBSophic was able to recognize that this was indeed the same query and this was the key to understanding what was happening on this database instance. They are both instances of the query

SELECT someColums FROM TSDBA.DP_ORDER_EXTENDED_VIEW WHERE REGION_ID = value AND ROUTE_DATE = value AND ROUTE_ID = AND INTERNAL_STOP_ID = value
The database server is still overloaded but with the help of my client I have identified the source of these queries and I now understand why this particular instance of the application is behaving much differently from what we have tested in the lab. The next step is to determine if there is a way to reduce the demand by either issuing the query less (maybe some caching at the web service) or making it more efficient by tuning. If neither of these are possible then the problem will need to be addressed by increasing CPU capacity at the database server.

This tool helped me find patterns in trace data that would have otherwise gone unnoticed. Thanks DBSophic! Great tool.

(P.S. - I have no relationship with DBSophic other than being a satisfied user of their tool)

Monday, January 4, 2010

Response time statistics using SQL

I typically use Excel for statistical analysis of response times. Occasionally, for very large data sets, I use SQL. Here is some handy SQL to calculate min, max, average, std deviation, and 95th percentiles from response time data. The response time data for this example came from a test where I had established steady state periods of activity at 300, 600, 900, and 1200 virtual users.

This method doesn't do any sort of interpolation and therefore requires at least 100 samples to give a reasonable result for percentiles. The base table of timers consists of the columns timername (type varchar), activeusers (number of active users at the time the timer was recorded, type int), and the 'elapsed time' of the event timed as a floating point number.

---------------------------------------------------------------------------------------------

select
    timername as 'Timer Name',
    activeusers as 'Active Users',
    count(elapsedtime) as 'Count',
    avg(elapsedtime) as 'Average',
    stdev(elapsedtime) as 'StdDev',
    min(elapsedtime) as 'Min',
    max(elapsedtime) as 'Max',
    (select max(elapsedtime) from
        (select top 95 percent elapsedtime
         from timers b
             where b.timername = a.timername and b.activeusers = a.activeusers
             order by b.elapsedtime asc) as elapsedtime) as '95th Percentile'
from timers a
    where a.activeusers in (300,600,900,1200)
    group by a.timername, a.activeusers
    order by a.timername asc, a.activeusers asc

---------------------------------------------------------------------------------------------

After executing the query, I cut and paste the output to Excel.




Graphing count vs VUs gives show how throughput varies with load. Here is a graph of home page throughput.



And a graph of 95th percentile homepage response time vs VUs which indicates some sort of bottleneck above 900 VUs.