Friday, November 26, 2010

Pro tip: When searching for a needle in a haystack, use strong magnets.

I was recently hired to diagnose a performance problem for a client having a server farm in Florida and remote users throughout the continental US and Alaska. Their application is used to plan and manage routes for delivery vehicles, is graphically intense, and has a number of compute intensive background processes for routing and mapping. It also places a moderate load on the database server. One would expect some occasional UI lagging from a graphically intense app given that the Citrix farm is located a few thousand miles away, but they were also experiencing five to ten minute delays when saving the states of their routing sessions. The Citrix servers were located on the same LAN as the database server, and other than presentation graphics, not much else was being sent across the WAN. One of the biggest challanges was that the system would perform normally for days at a time until out of the blue, usually late in the evening, users in California would report 5-10 minute delays when saving large amounts of data.

After giving up trying to react fast enough to catch the fault in progress I was asked to do a comprehensive audit on their cluster to find the problem. As part of the audit, I wanted to capture and analyze SQL traces and O/S performance counters. Since the problem was very intermittent, I decided to collect data 24 hours a day and hope that I got lucky. The client was very nervous about collecting SQL trace data for long periods of time. I decided to set up SQL trace collections each night between 8pm and 11pm. For OS counters, I started collecting perfmon stats for all objects with a 5 minute interval with the intent of letting it run 7X24 until I captured at least one major delay.

As luck would have it, a long delay occurred the second night of SQL tracing. Great! Now at least I have something to analyze. Lets see now, perf counters for 8 servers, all objects, 24 hours of data. Wow, what was I thinking? I collected the data in .csv files and Excel is complaining it can't load all the data. Here I turned to a tool from Microsoft call PAL. It analyzes perfmon data, applies tribal knowledge to thresholds, and produces a report with commentary and graphs. The reports include derived values (usually ratios) that are further scrutinized along with links to articles giving background information related to the commentary. PAL warrants a separate post which I will do later, but for now it is a great new addition to my tool kit and you can read more about it here and  here. The reports are in HTML or XML format. You can view a sample here (which I converted to PDF format for this post.)

The OS counters indicated that at the time of the delay there was a great deal of I/O on, and relatively poor service times form, the database data disks. The other servers showed no significant spike in utilization. I was also able to glean from the PAL report that it was a low number of SPIDs within the database causing the spike in I/O.

Armed with this information, I decided to ignore the other 7 servers and focus on the database server. I turn to the SQL trace data I collected and found 1.5 GB of data. I've previously blogged about a cool SQL trace analyzer from DBSophic, but I'd only analyzed small traces ( less than 100MB) of data, and it took hours to do that. Coincidently, a week prior this I had received an offer from the CTO of DBSophic to evaluate the  next generation SQL trace analyzer; Qure Workload Analyzer (QWA). The tease was "Calling QWA a successor to Trace Analyzer is like calling an F-35 stealth fighter a successor to Wright’s first airplane, well - both can fly…". Given that the old tool would have taken days to process 1.5GB of trace data, I'm kind of hoping that this tools has performance a lot closer to an F35 than the Wright Flyer.

Normally, I'd provide a link here, but I was given a trial version. I'll edit this post when it becomes generally available or you can check DBSophic's web site yourself to inquire about or download QWA. I launch QWA and read in a few trace files. Its fast, a lot faster than I remember. I didn't do any comprehensive performance tests of the last version, but I do remember the run times were not linear with volume. In plain english, if you doubled the amount of data, it took a lot longer than twice the time to process. I ran a few quick tests on both versions using small amounts of data (50MB or so) and found the new tool was 4 to 5 times faster. I was able to process 1.5GB of trace data (thats 3 hours of SQL trace data spread over 319 trace files) in just under 25 minutes. A very nice improvement! I could not have used the old tool to process this much data, so my first observation is that the tool IS, metaphorically speaking,  a lot closer to an F35 in performance. Without it, this project may have crashed before getting off the runway!

The old version of Trace Analyzer was able to group similar queries and aggregate useful statistics like run time, CPU utilization, and I/O rates. QWA takes analysis to a new level. New features include multiple grouping options, smart filters, improved performance, and the ability to compare workloads (with normalization and weighting options... which I haven't tried yet). Ok, sorry if I am starting to sound like a salesman, but I am really impressed with the ability and usability of this tool.

I was able to use QWA to sift through the data and I quickly found a delete statement that took 5 minutes and two queries all of which were being executed around the same time the slow down was reported. Finding these three statements was the key to identifying the problem.  I don't want to think about how long this would have taken without QWA. The delete statement was coming from the user doing the save. Both queries looked to be related to ad hoc reports. I ran the first query in parallel with the Save Session function that was running so slow, and it had no effect. I ran the second query and the Save Session operation did not complete until after the query completed. Finally! After months of searching, I can finally reproduce the problem.

The query was a five way join over a large amount of data and took about five minutes to complete. The query was taking out share locks in a number of tables that the Save function was trying to modify, hence the blocking. The odd thing was that SQL server showed the blocked SPID as suspended, not blocked (and did not show up on the blocked transaction report delivered with SQL server management studio.) I'm still scratching my head about that, but we have at least found the root cause of the problem which was a consequence of the default strategy that SQL server uses to insure read consistency. There is a good description of read consistency and isolation levels here.

My sincere thanks to the folks at DBSophic. Great people and great tools!

(P.S. - I'll may update this post with pics and some examples QWA's features once the tool becomes generally available.)