Friday, November 26, 2010

Pro tip: When searching for a needle in a haystack, use strong magnets.

I was recently hired to diagnose a performance problem for a client having a server farm in Florida and remote users throughout the continental US and Alaska. Their application is used to plan and manage routes for delivery vehicles, is graphically intense, and has a number of compute intensive background processes for routing and mapping. It also places a moderate load on the database server. One would expect some occasional UI lagging from a graphically intense app given that the Citrix farm is located a few thousand miles away, but they were also experiencing five to ten minute delays when saving the states of their routing sessions. The Citrix servers were located on the same LAN as the database server, and other than presentation graphics, not much else was being sent across the WAN. One of the biggest challanges was that the system would perform normally for days at a time until out of the blue, usually late in the evening, users in California would report 5-10 minute delays when saving large amounts of data.

After giving up trying to react fast enough to catch the fault in progress I was asked to do a comprehensive audit on their cluster to find the problem. As part of the audit, I wanted to capture and analyze SQL traces and O/S performance counters. Since the problem was very intermittent, I decided to collect data 24 hours a day and hope that I got lucky. The client was very nervous about collecting SQL trace data for long periods of time. I decided to set up SQL trace collections each night between 8pm and 11pm. For OS counters, I started collecting perfmon stats for all objects with a 5 minute interval with the intent of letting it run 7X24 until I captured at least one major delay.

As luck would have it, a long delay occurred the second night of SQL tracing. Great! Now at least I have something to analyze. Lets see now, perf counters for 8 servers, all objects, 24 hours of data. Wow, what was I thinking? I collected the data in .csv files and Excel is complaining it can't load all the data. Here I turned to a tool from Microsoft call PAL. It analyzes perfmon data, applies tribal knowledge to thresholds, and produces a report with commentary and graphs. The reports include derived values (usually ratios) that are further scrutinized along with links to articles giving background information related to the commentary. PAL warrants a separate post which I will do later, but for now it is a great new addition to my tool kit and you can read more about it here and  here. The reports are in HTML or XML format. You can view a sample here (which I converted to PDF format for this post.)

The OS counters indicated that at the time of the delay there was a great deal of I/O on, and relatively poor service times form, the database data disks. The other servers showed no significant spike in utilization. I was also able to glean from the PAL report that it was a low number of SPIDs within the database causing the spike in I/O.

Armed with this information, I decided to ignore the other 7 servers and focus on the database server. I turn to the SQL trace data I collected and found 1.5 GB of data. I've previously blogged about a cool SQL trace analyzer from DBSophic, but I'd only analyzed small traces ( less than 100MB) of data, and it took hours to do that. Coincidently, a week prior this I had received an offer from the CTO of DBSophic to evaluate the  next generation SQL trace analyzer; Qure Workload Analyzer (QWA). The tease was "Calling QWA a successor to Trace Analyzer is like calling an F-35 stealth fighter a successor to Wright’s first airplane, well - both can fly…". Given that the old tool would have taken days to process 1.5GB of trace data, I'm kind of hoping that this tools has performance a lot closer to an F35 than the Wright Flyer.

Normally, I'd provide a link here, but I was given a trial version. I'll edit this post when it becomes generally available or you can check DBSophic's web site yourself to inquire about or download QWA. I launch QWA and read in a few trace files. Its fast, a lot faster than I remember. I didn't do any comprehensive performance tests of the last version, but I do remember the run times were not linear with volume. In plain english, if you doubled the amount of data, it took a lot longer than twice the time to process. I ran a few quick tests on both versions using small amounts of data (50MB or so) and found the new tool was 4 to 5 times faster. I was able to process 1.5GB of trace data (thats 3 hours of SQL trace data spread over 319 trace files) in just under 25 minutes. A very nice improvement! I could not have used the old tool to process this much data, so my first observation is that the tool IS, metaphorically speaking,  a lot closer to an F35 in performance. Without it, this project may have crashed before getting off the runway!

The old version of Trace Analyzer was able to group similar queries and aggregate useful statistics like run time, CPU utilization, and I/O rates. QWA takes analysis to a new level. New features include multiple grouping options, smart filters, improved performance, and the ability to compare workloads (with normalization and weighting options... which I haven't tried yet). Ok, sorry if I am starting to sound like a salesman, but I am really impressed with the ability and usability of this tool.

I was able to use QWA to sift through the data and I quickly found a delete statement that took 5 minutes and two queries all of which were being executed around the same time the slow down was reported. Finding these three statements was the key to identifying the problem.  I don't want to think about how long this would have taken without QWA. The delete statement was coming from the user doing the save. Both queries looked to be related to ad hoc reports. I ran the first query in parallel with the Save Session function that was running so slow, and it had no effect. I ran the second query and the Save Session operation did not complete until after the query completed. Finally! After months of searching, I can finally reproduce the problem.

The query was a five way join over a large amount of data and took about five minutes to complete. The query was taking out share locks in a number of tables that the Save function was trying to modify, hence the blocking. The odd thing was that SQL server showed the blocked SPID as suspended, not blocked (and did not show up on the blocked transaction report delivered with SQL server management studio.) I'm still scratching my head about that, but we have at least found the root cause of the problem which was a consequence of the default strategy that SQL server uses to insure read consistency. There is a good description of read consistency and isolation levels here.

My sincere thanks to the folks at DBSophic. Great people and great tools!

(P.S. - I'll may update this post with pics and some examples QWA's features once the tool becomes generally available.)

3 comments:

  1. Thanks, I shall look at DBSophic’s website.

    However, I have a question about PAL.
    When you are working with PAL for performance analysis for a large time segment of data, do you get PAL to analyze the whole time period or individual segments?

    ReplyDelete
  2. Andrew,

    I usually use PAL to analyze the entire period for which I collect data. At the top of the report there are two summaries of results; the first shows alerts in chronological order, the second is organized by resource (CPU, Memory, Network, etc). This gives you two indexes into the results of it's analysis.

    Now, if I know that something bad happened at a specific time, I would make a copy of the .CSV file and delete all rows except those 15 minutes before and after the event. This would allow me to focus on alerts more likely to be related to the problem I am investigating. The sample report I posted here was for 15 minutes +/- from the time of the performance problem I was investigating.

    If I had no reports of a specific problem, I'd use PAL to analyze all the data and then look for alerts that indicate either configuration problems or hardware components with heavy utilization. This reduces a LOT of manual work in formating, graphing, etc... but I still need to review the alerts and apply my own analysis to determine what actions, if any, need to be taken.

    ReplyDelete
  3. My two cents on QURE DBSOPHIC: I totally agree with you. Great tools, loaded about 2 GB of trace data in 30 minutes and quickly analyzed it. A definite must have tool for any DBA and also for any development team (the loading of data in a centralized SQL Server and the capability to allow developers to install the analysis tool connecting to the database so that they can check how their code is performing is very useful).

    I also liked a lot the capability to extract properly sqlized batches that can be used to tune the query via SQL profiler.

    ReplyDelete