Friday, November 26, 2010

Pro tip: When searching for a needle in a haystack, use strong magnets.

I was recently hired to diagnose a performance problem for a client having a server farm in Florida and remote users throughout the continental US and Alaska. Their application is used to plan and manage routes for delivery vehicles, is graphically intense, and has a number of compute intensive background processes for routing and mapping. It also places a moderate load on the database server. One would expect some occasional UI lagging from a graphically intense app given that the Citrix farm is located a few thousand miles away, but they were also experiencing five to ten minute delays when saving the states of their routing sessions. The Citrix servers were located on the same LAN as the database server, and other than presentation graphics, not much else was being sent across the WAN. One of the biggest challanges was that the system would perform normally for days at a time until out of the blue, usually late in the evening, users in California would report 5-10 minute delays when saving large amounts of data.

After giving up trying to react fast enough to catch the fault in progress I was asked to do a comprehensive audit on their cluster to find the problem. As part of the audit, I wanted to capture and analyze SQL traces and O/S performance counters. Since the problem was very intermittent, I decided to collect data 24 hours a day and hope that I got lucky. The client was very nervous about collecting SQL trace data for long periods of time. I decided to set up SQL trace collections each night between 8pm and 11pm. For OS counters, I started collecting perfmon stats for all objects with a 5 minute interval with the intent of letting it run 7X24 until I captured at least one major delay.

As luck would have it, a long delay occurred the second night of SQL tracing. Great! Now at least I have something to analyze. Lets see now, perf counters for 8 servers, all objects, 24 hours of data. Wow, what was I thinking? I collected the data in .csv files and Excel is complaining it can't load all the data. Here I turned to a tool from Microsoft call PAL. It analyzes perfmon data, applies tribal knowledge to thresholds, and produces a report with commentary and graphs. The reports include derived values (usually ratios) that are further scrutinized along with links to articles giving background information related to the commentary. PAL warrants a separate post which I will do later, but for now it is a great new addition to my tool kit and you can read more about it here and  here. The reports are in HTML or XML format. You can view a sample here (which I converted to PDF format for this post.)

The OS counters indicated that at the time of the delay there was a great deal of I/O on, and relatively poor service times form, the database data disks. The other servers showed no significant spike in utilization. I was also able to glean from the PAL report that it was a low number of SPIDs within the database causing the spike in I/O.

Armed with this information, I decided to ignore the other 7 servers and focus on the database server. I turn to the SQL trace data I collected and found 1.5 GB of data. I've previously blogged about a cool SQL trace analyzer from DBSophic, but I'd only analyzed small traces ( less than 100MB) of data, and it took hours to do that. Coincidently, a week prior this I had received an offer from the CTO of DBSophic to evaluate the  next generation SQL trace analyzer; Qure Workload Analyzer (QWA). The tease was "Calling QWA a successor to Trace Analyzer is like calling an F-35 stealth fighter a successor to Wright’s first airplane, well - both can fly…". Given that the old tool would have taken days to process 1.5GB of trace data, I'm kind of hoping that this tools has performance a lot closer to an F35 than the Wright Flyer.

Normally, I'd provide a link here, but I was given a trial version. I'll edit this post when it becomes generally available or you can check DBSophic's web site yourself to inquire about or download QWA. I launch QWA and read in a few trace files. Its fast, a lot faster than I remember. I didn't do any comprehensive performance tests of the last version, but I do remember the run times were not linear with volume. In plain english, if you doubled the amount of data, it took a lot longer than twice the time to process. I ran a few quick tests on both versions using small amounts of data (50MB or so) and found the new tool was 4 to 5 times faster. I was able to process 1.5GB of trace data (thats 3 hours of SQL trace data spread over 319 trace files) in just under 25 minutes. A very nice improvement! I could not have used the old tool to process this much data, so my first observation is that the tool IS, metaphorically speaking,  a lot closer to an F35 in performance. Without it, this project may have crashed before getting off the runway!

The old version of Trace Analyzer was able to group similar queries and aggregate useful statistics like run time, CPU utilization, and I/O rates. QWA takes analysis to a new level. New features include multiple grouping options, smart filters, improved performance, and the ability to compare workloads (with normalization and weighting options... which I haven't tried yet). Ok, sorry if I am starting to sound like a salesman, but I am really impressed with the ability and usability of this tool.

I was able to use QWA to sift through the data and I quickly found a delete statement that took 5 minutes and two queries all of which were being executed around the same time the slow down was reported. Finding these three statements was the key to identifying the problem.  I don't want to think about how long this would have taken without QWA. The delete statement was coming from the user doing the save. Both queries looked to be related to ad hoc reports. I ran the first query in parallel with the Save Session function that was running so slow, and it had no effect. I ran the second query and the Save Session operation did not complete until after the query completed. Finally! After months of searching, I can finally reproduce the problem.

The query was a five way join over a large amount of data and took about five minutes to complete. The query was taking out share locks in a number of tables that the Save function was trying to modify, hence the blocking. The odd thing was that SQL server showed the blocked SPID as suspended, not blocked (and did not show up on the blocked transaction report delivered with SQL server management studio.) I'm still scratching my head about that, but we have at least found the root cause of the problem which was a consequence of the default strategy that SQL server uses to insure read consistency. There is a good description of read consistency and isolation levels here.

My sincere thanks to the folks at DBSophic. Great people and great tools!

(P.S. - I'll may update this post with pics and some examples QWA's features once the tool becomes generally available.)

Saturday, January 30, 2010

A cool SQL trace analyzer

My primary business is load testing and performance characterization but occasionally I am asked to analyze production instances that are behaving badly. This is a lot easier if you have load tested the application beforehand and know it's characteristics. Recently I was asked to look into a poorly performing application that my client's client implemented and then augmented with a custom web service that allows other devices to interface to the system. I'm being a little vague here to protect my client's privacy, but application details are not important to the story. So part of the application I know well, and part is a complete mystery. The web services connect directly to the database that I know well.

Fast forward to my first live session on the production instance, and it is apparent that this system is behaving differently than those I have tested. In my prior experiences with the application, a multi-tiered architecture, the database server was lightly used. On this instance, a 3 year old 8 core DB server was averaging 70-80% CPU busy which was an order of magnitude greater than what I would expect. The other tiers were relatively lightly loaded, so I focused on the database server (MS SQL Server 2005).

SQL server reports for top queries by total CPU and top query by average CPU showed one stored procedure using cursor processing was responsible for the majority of the CPU utilization. The stored procedure was modified to use more set processing and a few weeks later was QA'd and put into production. The result was disappointing. CPU utilization was barely changed. Whats going on here? Performance reports show that the stored procedure is now consuming 1/3 of the CPU resources it used to, but it is still the top consumer of CPU and CPU is still averaging about 70% busy.

One possible explanation is that demand so far exceeded capacity that by tuning the stored procedure I increased the throughput of the the system but the primary bottleneck remained DB server CPU. I have no way to measure throughput or overall system response times, so this will remain a theory at best. Still, I have a nagging feeling that I am missing something important. I decide to take a deeper look by generating a trace of activity using SQL profiler. This is a production system so I start by taking a short snapshot during a slow period of activity. There are 100s of spids (database sessions) active and I am flooded with trace information.

After sorting the data by CPU used or logical I/Os performed, I keep coming back to stored procedure we have already optimized. On whim, I started Googling looking for ideas or tools to help analyze the mountain of trace data before me. I stumbled onto a tool, Trace Analyzer from DBSophic that was available for free. OK, well the price is right so lets give it a try.

I downloaded the tool and pointed it at my trace files (about 50 MB worth of trace data) and a few minutes later the answer was staring me in the face. A query that never showed up on any of the reports bubbled to the top of the list.

After sorting by CPU I see that 95% of all CPU time is due to one query. This can not be a coincidence, but why didn't I notice this before? Both the top query by average CPU and top query by TOTAL CPU reports never mentioned this query. The reason why is pretty simple. While there were tens of thousands of these queries, they were similar, not identical. The selected columns , tables queried, and the the columns used in the where criteria were identical. The values used in the where criteria were all different. While the queries

are viewed as separate queries by the SQL Server's built in performance reports, the trace analyzer from DBSophic was able to recognize that this was indeed the same query and this was the key to understanding what was happening on this database instance. They are both instances of the query

The database server is still overloaded but with the help of my client I have identified the source of these queries and I now understand why this particular instance of the application is behaving much differently from what we have tested in the lab. The next step is to determine if there is a way to reduce the demand by either issuing the query less (maybe some caching at the web service) or making it more efficient by tuning. If neither of these are possible then the problem will need to be addressed by increasing CPU capacity at the database server.

This tool helped me find patterns in trace data that would have otherwise gone unnoticed. Thanks DBSophic! Great tool.

(P.S. - I have no relationship with DBSophic other than being a satisfied user of their tool)

Monday, January 4, 2010

Response time statistics using SQL

I typically use Excel for statistical analysis of response times. Occasionally, for very large data sets, I use SQL. Here is some handy SQL to calculate min, max, average, std deviation, and 95th percentiles from response time data. The response time data for this example came from a test where I had established steady state periods of activity at 300, 600, 900, and 1200 virtual users.

This method doesn't do any sort of interpolation and therefore requires at least 100 samples to give a reasonable result for percentiles. The base table of timers consists of the columns timername (type varchar), activeusers (number of active users at the time the timer was recorded, type int), and the 'elapsed time' of the event timed as a floating point number.


    timername as 'Timer Name',
    activeusers as 'Active Users',
    count(elapsedtime) as 'Count',
    avg(elapsedtime) as 'Average',
    stdev(elapsedtime) as 'StdDev',
    min(elapsedtime) as 'Min',
    max(elapsedtime) as 'Max',
    (select max(elapsedtime) from
        (select top 95 percent elapsedtime
         from timers b
             where b.timername = a.timername and b.activeusers = a.activeusers
             order by b.elapsedtime asc) as elapsedtime) as '95th Percentile'
from timers a
    where a.activeusers in (300,600,900,1200)
    group by a.timername, a.activeusers
    order by a.timername asc, a.activeusers asc


After executing the query, I cut and paste the output to Excel.

Graphing count vs VUs gives show how throughput varies with load. Here is a graph of home page throughput.

And a graph of 95th percentile homepage response time vs VUs which indicates some sort of bottleneck above 900 VUs.