Friday, November 26, 2010

Pro tip: When searching for a needle in a haystack, use strong magnets.

I was recently hired to diagnose a performance problem for a client having a server farm in Florida and remote users throughout the continental US and Alaska. Their application is used to plan and manage routes for delivery vehicles, is graphically intense, and has a number of compute intensive background processes for routing and mapping. It also places a moderate load on the database server. One would expect some occasional UI lagging from a graphically intense app given that the Citrix farm is located a few thousand miles away, but they were also experiencing five to ten minute delays when saving the states of their routing sessions. The Citrix servers were located on the same LAN as the database server, and other than presentation graphics, not much else was being sent across the WAN. One of the biggest challanges was that the system would perform normally for days at a time until out of the blue, usually late in the evening, users in California would report 5-10 minute delays when saving large amounts of data.

After giving up trying to react fast enough to catch the fault in progress I was asked to do a comprehensive audit on their cluster to find the problem. As part of the audit, I wanted to capture and analyze SQL traces and O/S performance counters. Since the problem was very intermittent, I decided to collect data 24 hours a day and hope that I got lucky. The client was very nervous about collecting SQL trace data for long periods of time. I decided to set up SQL trace collections each night between 8pm and 11pm. For OS counters, I started collecting perfmon stats for all objects with a 5 minute interval with the intent of letting it run 7X24 until I captured at least one major delay.

As luck would have it, a long delay occurred the second night of SQL tracing. Great! Now at least I have something to analyze. Lets see now, perf counters for 8 servers, all objects, 24 hours of data. Wow, what was I thinking? I collected the data in .csv files and Excel is complaining it can't load all the data. Here I turned to a tool from Microsoft call PAL. It analyzes perfmon data, applies tribal knowledge to thresholds, and produces a report with commentary and graphs. The reports include derived values (usually ratios) that are further scrutinized along with links to articles giving background information related to the commentary. PAL warrants a separate post which I will do later, but for now it is a great new addition to my tool kit and you can read more about it here and  here. The reports are in HTML or XML format. You can view a sample here (which I converted to PDF format for this post.)

The OS counters indicated that at the time of the delay there was a great deal of I/O on, and relatively poor service times form, the database data disks. The other servers showed no significant spike in utilization. I was also able to glean from the PAL report that it was a low number of SPIDs within the database causing the spike in I/O.

Armed with this information, I decided to ignore the other 7 servers and focus on the database server. I turn to the SQL trace data I collected and found 1.5 GB of data. I've previously blogged about a cool SQL trace analyzer from DBSophic, but I'd only analyzed small traces ( less than 100MB) of data, and it took hours to do that. Coincidently, a week prior this I had received an offer from the CTO of DBSophic to evaluate the  next generation SQL trace analyzer; Qure Workload Analyzer (QWA). The tease was "Calling QWA a successor to Trace Analyzer is like calling an F-35 stealth fighter a successor to Wright’s first airplane, well - both can fly…". Given that the old tool would have taken days to process 1.5GB of trace data, I'm kind of hoping that this tools has performance a lot closer to an F35 than the Wright Flyer.

Normally, I'd provide a link here, but I was given a trial version. I'll edit this post when it becomes generally available or you can check DBSophic's web site yourself to inquire about or download QWA. I launch QWA and read in a few trace files. Its fast, a lot faster than I remember. I didn't do any comprehensive performance tests of the last version, but I do remember the run times were not linear with volume. In plain english, if you doubled the amount of data, it took a lot longer than twice the time to process. I ran a few quick tests on both versions using small amounts of data (50MB or so) and found the new tool was 4 to 5 times faster. I was able to process 1.5GB of trace data (thats 3 hours of SQL trace data spread over 319 trace files) in just under 25 minutes. A very nice improvement! I could not have used the old tool to process this much data, so my first observation is that the tool IS, metaphorically speaking,  a lot closer to an F35 in performance. Without it, this project may have crashed before getting off the runway!

The old version of Trace Analyzer was able to group similar queries and aggregate useful statistics like run time, CPU utilization, and I/O rates. QWA takes analysis to a new level. New features include multiple grouping options, smart filters, improved performance, and the ability to compare workloads (with normalization and weighting options... which I haven't tried yet). Ok, sorry if I am starting to sound like a salesman, but I am really impressed with the ability and usability of this tool.

I was able to use QWA to sift through the data and I quickly found a delete statement that took 5 minutes and two queries all of which were being executed around the same time the slow down was reported. Finding these three statements was the key to identifying the problem.  I don't want to think about how long this would have taken without QWA. The delete statement was coming from the user doing the save. Both queries looked to be related to ad hoc reports. I ran the first query in parallel with the Save Session function that was running so slow, and it had no effect. I ran the second query and the Save Session operation did not complete until after the query completed. Finally! After months of searching, I can finally reproduce the problem.

The query was a five way join over a large amount of data and took about five minutes to complete. The query was taking out share locks in a number of tables that the Save function was trying to modify, hence the blocking. The odd thing was that SQL server showed the blocked SPID as suspended, not blocked (and did not show up on the blocked transaction report delivered with SQL server management studio.) I'm still scratching my head about that, but we have at least found the root cause of the problem which was a consequence of the default strategy that SQL server uses to insure read consistency. There is a good description of read consistency and isolation levels here.

My sincere thanks to the folks at DBSophic. Great people and great tools!

(P.S. - I'll may update this post with pics and some examples QWA's features once the tool becomes generally available.)

Saturday, January 30, 2010

A cool SQL trace analyzer

My primary business is load testing and performance characterization but occasionally I am asked to analyze production instances that are behaving badly. This is a lot easier if you have load tested the application beforehand and know it's characteristics. Recently I was asked to look into a poorly performing application that my client's client implemented and then augmented with a custom web service that allows other devices to interface to the system. I'm being a little vague here to protect my client's privacy, but application details are not important to the story. So part of the application I know well, and part is a complete mystery. The web services connect directly to the database that I know well.

Fast forward to my first live session on the production instance, and it is apparent that this system is behaving differently than those I have tested. In my prior experiences with the application, a multi-tiered architecture, the database server was lightly used. On this instance, a 3 year old 8 core DB server was averaging 70-80% CPU busy which was an order of magnitude greater than what I would expect. The other tiers were relatively lightly loaded, so I focused on the database server (MS SQL Server 2005).

SQL server reports for top queries by total CPU and top query by average CPU showed one stored procedure using cursor processing was responsible for the majority of the CPU utilization. The stored procedure was modified to use more set processing and a few weeks later was QA'd and put into production. The result was disappointing. CPU utilization was barely changed. Whats going on here? Performance reports show that the stored procedure is now consuming 1/3 of the CPU resources it used to, but it is still the top consumer of CPU and CPU is still averaging about 70% busy.

One possible explanation is that demand so far exceeded capacity that by tuning the stored procedure I increased the throughput of the the system but the primary bottleneck remained DB server CPU. I have no way to measure throughput or overall system response times, so this will remain a theory at best. Still, I have a nagging feeling that I am missing something important. I decide to take a deeper look by generating a trace of activity using SQL profiler. This is a production system so I start by taking a short snapshot during a slow period of activity. There are 100s of spids (database sessions) active and I am flooded with trace information.

After sorting the data by CPU used or logical I/Os performed, I keep coming back to stored procedure we have already optimized. On whim, I started Googling looking for ideas or tools to help analyze the mountain of trace data before me. I stumbled onto a tool, Trace Analyzer from DBSophic that was available for free. OK, well the price is right so lets give it a try.

I downloaded the tool and pointed it at my trace files (about 50 MB worth of trace data) and a few minutes later the answer was staring me in the face. A query that never showed up on any of the reports bubbled to the top of the list.



After sorting by CPU I see that 95% of all CPU time is due to one query. This can not be a coincidence, but why didn't I notice this before? Both the top query by average CPU and top query by TOTAL CPU reports never mentioned this query. The reason why is pretty simple. While there were tens of thousands of these queries, they were similar, not identical. The selected columns , tables queried, and the the columns used in the where criteria were identical. The values used in the where criteria were all different. While the queries
SELECT someColumns FROM TSDBA.DP_ORDER_EXTENDED_VIEW WHERE REGION_ID = 'ABC' AND ROUTE_DATE = {TS '2010-01-28 00:00:00'} AND ROUTE_ID = 'DEF' AND INTERNAL_STOP_ID = 1
and
SELECT someColumns FROM TSDBA.DP_ORDER_EXTENDED_VIEW WHERE REGION_ID = 'GHI' AND ROUTE_DATE = {TS '2010-01-28 00:00:00'} AND ROUTE_ID = 'JKL' AND INTERNAL_STOP_ID = 2

are viewed as separate queries by the SQL Server's built in performance reports, the trace analyzer from DBSophic was able to recognize that this was indeed the same query and this was the key to understanding what was happening on this database instance. They are both instances of the query

SELECT someColums FROM TSDBA.DP_ORDER_EXTENDED_VIEW WHERE REGION_ID = value AND ROUTE_DATE = value AND ROUTE_ID = AND INTERNAL_STOP_ID = value
The database server is still overloaded but with the help of my client I have identified the source of these queries and I now understand why this particular instance of the application is behaving much differently from what we have tested in the lab. The next step is to determine if there is a way to reduce the demand by either issuing the query less (maybe some caching at the web service) or making it more efficient by tuning. If neither of these are possible then the problem will need to be addressed by increasing CPU capacity at the database server.

This tool helped me find patterns in trace data that would have otherwise gone unnoticed. Thanks DBSophic! Great tool.

(P.S. - I have no relationship with DBSophic other than being a satisfied user of their tool)

Monday, January 4, 2010

Response time statistics using SQL

I typically use Excel for statistical analysis of response times. Occasionally, for very large data sets, I use SQL. Here is some handy SQL to calculate min, max, average, std deviation, and 95th percentiles from response time data. The response time data for this example came from a test where I had established steady state periods of activity at 300, 600, 900, and 1200 virtual users.

This method doesn't do any sort of interpolation and therefore requires at least 100 samples to give a reasonable result for percentiles. The base table of timers consists of the columns timername (type varchar), activeusers (number of active users at the time the timer was recorded, type int), and the 'elapsed time' of the event timed as a floating point number.

---------------------------------------------------------------------------------------------

select
    timername as 'Timer Name',
    activeusers as 'Active Users',
    count(elapsedtime) as 'Count',
    avg(elapsedtime) as 'Average',
    stdev(elapsedtime) as 'StdDev',
    min(elapsedtime) as 'Min',
    max(elapsedtime) as 'Max',
    (select max(elapsedtime) from
        (select top 95 percent elapsedtime
         from timers b
             where b.timername = a.timername and b.activeusers = a.activeusers
             order by b.elapsedtime asc) as elapsedtime) as '95th Percentile'
from timers a
    where a.activeusers in (300,600,900,1200)
    group by a.timername, a.activeusers
    order by a.timername asc, a.activeusers asc

---------------------------------------------------------------------------------------------

After executing the query, I cut and paste the output to Excel.




Graphing count vs VUs gives show how throughput varies with load. Here is a graph of home page throughput.



And a graph of 95th percentile homepage response time vs VUs which indicates some sort of bottleneck above 900 VUs.



Wednesday, December 23, 2009

Using OpenSTA in the Amazon EC2 (part 3)

(...continued from Part 2)

In part 2 of this blog series, we started two instances and configured the opensta name servers so that one instance became the master and the other a slave. To make this example a little more interesting, I'll be illustrating a few concepts I discussed in my first blog on performance testing strategies.

For this exercise, the sample workload definition is as follows:

  • Script1 will be executed 30% of the time. It will access the home page for www.iperformax.com and record the elapsed time with a label of "homepage".
  • Script2 will be executed 60% of the time. It will access the page for services at http://www.iperformax.com/services.html and record the elapsed time with a label of "services".
  • Script3 will be executed 10% of the time. It will access the page for services at http://www.iperformax.com/testimonials.html and record the elapsed time with a label of "testimonials".
  • The average time spent viewing a page will be 4 minutes.
This test will be exploratory in nature. I don't have a specific load or service level objective to meet, but I do want to find the limits of my web site where a small increase in load results in a large degradation in response time. To do so,  I will run up to 2,400 virtual users. As I want to see how response time and throughput change vs load, I need to vary the load. I will take a stepwise approach where I ramp up 25% of peak load in the first 5 minutes and then allow the test to run in a steady state for 10 minutes. I will repeat this pattern 4 times. Afterwards, I will examine the response time and throughput for the four 10 minute periods where the load was constant (e.g. at 600, 1200, 1800, and 2400 virtual users).

This is not meant as an OpenSTA scripting tutorial (read more about mentoring, training, and support available from Performax), but I will post the master script that I developed for this example. Lets review where we are with respect to injectors, in the OpenSTA commander menu->Tools->Architecture Manager shows the following.





This output indicates that the master node's server name is IP-0AF41E63 and the slave (highlighted in reverse video) is named IP-0AF5C5F1. A good check to do at this point is verify that both master and slave are on the same subnet. To do so, I will initiate a trace route from the master node to the slave.



Indeed they are. In fact, they are on the same physical server. If on different servers, you would see at least two more hops. If on different subnets, the IP address outside of the 10.X.X.X range would NOT have identical values for the first 3 numbers (X.Y.Z.*) in the IP addresses. I have never launched multiple servers into the same region at the same time and had them end up on different subnets. While apparently rare, it can happen and any two servers located on different subnets will not work together in a master/slave distributed test as is being described here.

I have tried to add to a set of instances launched earlier in the day (same region) and found the new ones were on different subnets. If you think you will need 10 servers, launch them at the same time or you may have to start over. There is a way (OpenSTA daemon relay) to work around the subnet restriction, but it is beyond the scope of this post.

The astute reader will notice that these two addresses in the trace route output are not on the same subnet. Well, the REAL restriction (AFAIK) here is that the nodes must be able to send multi-cast messages to each other, and the way the network is implemented between VMs allow instances on the a physical machine to send multi-cast messages to each other even though their addresses may indicate they are on different subnet.

Ok, back to commander. I have created a master script. Since we are running a two server test, I will need to create two task groups. First I create a test called BLOG_TEST and then drag and drop  the master script onto the test grid under task1.



By default, this task group is assigned to localhost (the master server). Lets customize this task group to run half of the users and then clone it to also run on the slave. Click on the VUs column for task 1, and check the box to 'Introduce virtual users in batches' to open the batch start options dialog. I will assign 1,200 VUs according to the ramp-up I described earlier.




Next, I want to limit this test to 1 hour. I click on on the 'Start' column for task group 1 and select 'after fixed time' in the Stop Task Group drop down list box and enter a time limit of 1 hour (hh:mm:ss) in the Time Limit box.




The last step is to clone this task group (right mouse anywhere on the first task group and select 'duplicate task group') and change the HOST cell to IP-0AF5C5F1(the slave server name). I could also specify the ip address of the slave node here instead of typing the server name. The second task group is now a clone of the first, but will run from the slave.



Before we start here is a look at the master script that will run Blog_Scenario_1 30% of the time, Blog_Scenario_2 60% of the time and Blog_Scenario_3 10% of the time.




Here is the load profile produced by the test



There you have it. It took a lot longer to write this than it did to setup the instances, create the scripts, and run the test. I will be making my OpenSTA instance available via Amazon's EC2 in the near future. It comes with a special build of OpenSTA that is customized for .NET applications (large variable support and built in URL encoding) plus an OpenSTA script processor that does automated viewstate handling and allows creating scripts without hours of manually editing. For more information on using this instance or questions about projects, training or support, email me at bernie.velivis@iperformax.com


Bernie Velivis, President Performax Inc

Thursday, December 17, 2009

Using OpenSTA in the Amazon EC2 (part 2)

(...continued from Part 1)

The status of an instance changes to Running as it boots. It usually takes another ten to fifteen minutes before the instance is ready to accept logins. In this example, I will be creating a master node and a single slave. While these terms may not be used in the OpenSTA documentation, I define master node as the server that holds the test repository. The master node is used to control all aspects of a test using OpenSTA Commander as the primary interface.

Once started, both VMs are identical with the exception of their IP address and network name. To help me keep track of which server is which, I use the "tag" function to name each of instance. Right mouse on a server in the instance pane and choose "add tag". I'll designate the first one as master and the second as slave1 to help me keep track of their roles.



For each of the instances started, EC2 creates both internal (private) and external (public) addresses. For machines to talk to one another within the EC2, they must use either their private IP address or DNS name. To connect to these instances from outside the EC2, use the public IP or DNS name. If you right mouse over an instance, you can view its details, copy the details to the paste buffer, or simply "connect to instance" which starts the remote desktop program (RDP) and points it at the server you selected. You can also "get password" for a newly created AMI if one has not been assigned.

The first step in configuring the new instances is to connect to the master node using RDP. One of the first things you will be greeted with is a message that the OpenSTA name server has exited and you will be asked if you want to send an error report. Just cancel the dialogs. Whats happening here? When OpenSTA was installed on the AMI, it took note of the server name and ip address of the machine it was installed on and it remembers and uses that information to connect to the repository which holds all scripts, data, test definitions etc. The error message occurs because when the instance we just connected to was booted, it was assigned a new IP address. The name server (the background process that handles all the distributed communications) can't reach the repository due to stale information about the IP address of the repository host.

The fix is trivial, but must be done each time a system (master or slave) is started. It is also important what order the name servers are fixed and restarted. Starting with the master node, login, dismiss the error message, then right mouse on the name server icon in systray (looks like a green wheel) and select CONFIGURE. Enter 'localhost' in the dialog box for "repository host".




Note that I have also moved my repository to a different directory  (c:\dropbox\...). When an AMI is started, the C: drive reverts to the state it was in when the AMI was bundled. Any changes we make to the contents of C: will be LOST when this instance is shutdown. Rather inconvenient to create scripts, run tests, do all sorts of analysis only to have the files lost after we shut down. I have opted to use free software (dropbox) which replicates the contents of the dropbox directory (on a local drive) to a network drive (no doubt somewhere in dropbox's cloud). On my office PC, I run dropbox to replicate from this network drive to a local drive. This replication is bi-directional. There is rudimentary conflict resolution at the file level, but no distributed locking or sharing mechanism at the file record level. Any changes to the repository on the EC2 instance is replicated to my PC and vise-versa. This allows me to use my PC as a work bench for scripting and post test analysis using my favorite tools and use the cloud for running large tests. More about this in a future post.

Ok, back to configuring the name server on the master node. After entering localhost in the Repository Host field, click the OK button. You must now restart the name server. To do so, right mouse on the name server icon and select "shutdown". Once the name server icon disappears, launch the name server again (start->All Programs->OpenSTA->OpenSTA Name Server). Verify it is configured correctly by right mousing on the name server icon and selecting "registered objects". You should see something like this:



Note the value 10_209_67_50. It is based on the EC2 private ip address for the master instance we just configured. Remember the IP address 10.209.67.50. It is the master node IP address and we are going to need it in a few minutes as we repeat this process on each slave instance we started. Remember, the master node holds the repository. The only difference between a master node and a slave node is that a slave node has a remote server IP or Network name as a "Repository Host" in the name server dialog box.

Next, create an RDP session to slave1. I prefer to RDP from the master node to each of the slaves. If using a lot of slaves, run elasticfox on the master node, highlight ALL slaves in the instances pane, right mouse over the multiple selected instances and select Connect To Instance. This will start as many RDP sessions as you have slaves. All you need to do is manually type in the user name and password.

Upon logging in to each slave, you will be we are greeted with the same error message about the name server exiting. Repeat the steps outlined above to reconfigure the slave's name server but this time specify the master node's private IP address (in this example 10.209.67.50) as repository host in the name server configuration dialog. Next shutdown the name server and restart it. Give the Slave's name server a couple of minutes to complete its process of registering with the master node.

This process needs to be repeated for every slave instance. Once you have logged into the slave, leave the RDP session going since logging out will stop the name server. I prefer to initiate the RDP sessions from the master node, and keep just one RDP session from my PC to the master up to keep clutter at a minimum. Keep in mind that if you disconnect the RDP session the remote login will remain. Just don't log off from the remote slave. There may be a way to run the name server as a service, but that will take more work. Should I find a way, I will blog about it and likely rework this portion of the guide.

When finished with all the slaves, we can use OpenSTA commander on the master node to verify what servers have joined this cluster of master and slaves. To do so, in commander, tools->Architecture Manager which shows the following display. Note the top node is master, the slave we just configured appears below it. Focusing on any node in the list shows details like computer name (important, we'll need this later), OS info, memory, number of processors, etc. Here is what it looks like;



The top server in the list is the master. The second node is our first slave. Clicking on any server in the list will display information about that system including its computer name, OS info, memory available, etc.

This process takes about 5 minutes to configure a handful of servers. The process sounds complicated but can be summed up succinctly as; Connect to the master node and configure the name server's "repository host" to be local host and restart the name server. Next, connect to each slave and configure the name server's "repository host" to be the (private) IP address of the master node and restart the name server. Verify all nodes are configured correctly with menu option tools->Architecture Manager in Commander.

At this point, you are ready to run multi-injector tests in the cloud which we will do in my next installment.

(continued in part 3)


Bernie Velivis, President Performax Inc

Monday, December 14, 2009

Using OpenSTA in the Amazon EC2 (part 1)

EC2, what is it? Amazon Elastic Compute Cloud consists of leased computers running virtual machines. It offers virtually unlimited compute and network scalability, on the operating system of your choice, on demand, dirt cheap. Read more about getting started with EC2 here.

OpenSTA, what is it? OpenSTA is a distributed software load testing tool designed around CORBA.  The current toolset has the capability of performing scripted HTTP and HTTPS heavy load tests with performance measurements from Win32 platforms. OpenSTA is open source and totally free (well, free in the sense that puppies are free... training, support, and maintenance are available at additional cost.)

Why OpenSTA on EC2? To be able to run 10's of thousands of virtual users with as much network bandwidth as you want for about $10 PER DAY for each 1,500 virtual users.

Interested? Good, then lets get to it. The first step is installing the  Elasticfox and the S3 organizer plug-ins for Firefox. Elasticfox allows you to start, save, reboot, and terminate AMIs (Amazon EC2 machine instances, a VM image). S3 organizer allows you to create and manage permanent storage. AMIs are stored in S3 "buckets". Each time you start an AMI from Elasticfox, it reverts to the state it was when the AMI was bundled (amazon speak for "saved") to an S3 bucket. I created a private AMI using the current beta release of OpenSTA running on Windows Server 2003. Here is a screen shot of Elasticfox.




Once the AMI was started, I logged in, installed OpenSTA and some other software I'll describe later and then created an S3 bucket (performax-opensta-v11) to make a permanent copy of my changes.





When you have the AMI in the state you want, use Elasticfox to bundle (save) it to the bucket you just created using the S3 Organizer. To do this, goto the instances pane in Elasticfox, right mouse over the running instance, and 'bundle into an AMI' specifying the name of the S3 bucket you just created. Its a little confusing the first time you do this, but hang in there, all things seem hard until they become easy. I'm glossing over a few details here, but this is not meant to be a tutorial on Elasticfox and S3.

Once the AMI is created, you can start as many instances as you like. By default, they will all be standalone instances but can be configured to work with one another in a master/slave relationship as long as they are started in the same region. OpenSTA states that servers need to be on the same subnet to cooperate as multiple injectors for the same test. My experience is that as long as they are on the same LAN and can multi-cast messages to one another, they can cooperate with one another. To start one or more instance of the OpenSTA AMI you created, goto the Elasticfox pane for images, filter on "My AMIs" to see only your AMIs, right mouse on the AMI and select "Launch instance(s) of this AMI" which will bring up a dialog for starting instances.





In this example I have selected Instance type m1.small. This creates a single CPU instance with enough compute capacity and memory to handle all but the most compute intensive scripts. Larger instances cost about 4 times as much, so use the small ones unless you know you need more.

In this example, I set maximum number of instances to 2 and specified the Availability Zone us-east-1d to be sure they are all started in the same physical location. I plan to create a two server instance capable of running up to 3000 virtual users and need both instances to be on the the same LAN. Its takes a good 15 to 20 minutes for windows AMIs to start. My next blog picks up after the AMI is started. Watch the state column in the instances tabs for progress.



Continued in Part 2


Bernie Velivis, President Performax Inc

Monday, December 7, 2009

A model for understanding throughput

I think a great mental model for understanding throughput and capacity planning is that of a highway and toll system.

Cars represent the demand as they travel between points A and B. The highway lane(s) and toll booth(s) between A and B are the service centers where cars spend their time traveling. The highway speed limit and service time of cars at the toll booths quantify efficiency. The number of lanes and toll booths quantify parallelism.

As the system approaches its capacity limit, the queue for the toll booth(s) will grow as cars wait for service. If you want to measure the throughput of the system, all you need to do is count cars as they leave the slowest resource, in this case the toll booth(s).

Shrini's client wants to increase throughput. His highly dubious colleague suggested adding more cars. Lets run that idea through the mental model. First 2 numbers, speed limit on the single lane, single booth highway is 100 kph, and the tool booth take 20 seconds to service a car. The capacity is this system is limited by the slowest resource, in this case 3 cars a minute. If you stand at the end of the highway and count cars, you will count a maximum of 3 cars a minute.

But lets not confuse capacity with throughput. Capacity is what CAN flow through the system. Throughput is what IS flowing through the system. If the flow of cars is 2 cars per minute on this highway, then ADDING CARS will indeed increase throughput! So, depending on the initial conditions, adding demand could increase throughput.

If the system is operating at it's capacity limit, then adding cars will do nothing for throughput, and in fact will only serve to increase the total service time for individual cars traveling from point A to B.

Now we have a good mental model to explore what happens as we increase the speed of our service centers (the highway speed limit and the tool booths).

Back to my original statement, it might be clearer that there are two options to increase throughput;

1) servicing individual requests faster (i.e. greater efficiency)
2) servicing more requests in parallel (i.e. greater concurrency)

You accomplish point 1 by increasing the speed limit or reducing time spent in the toll booths. You accomplish point 2 by adding highway lanes and adding toll booths.

To complete this mental model, we need to introduce some sort of contention for a shared resource. Lets imagine that as we go to multiple toll booths, each toll booth now must record the money collected using an accumulator so the greedy booth manager, Count D'Monet, knows the funds collected at any time. The booth sends a signal to an accumulator and can not release the car until the accumulator signals that the fare has been recorded. Lets say that the accumulator takes 4 seconds to perform its task and signal the booth to release the car.

With the new system you arrive at a toll booth and have to wait 20 seconds for normal service, 4 seconds for the accumulator AND some time waiting for the accumulator to service a car from the other lane (it is single threaded). How much additional time waiting? It depends on arrival patterns, but lets say on average you arrive half way through the accumulator servicing another car. That gives you 20 seconds of tool booth time plus 2 seconds waiting for the accumulator servicing another car plus 4 seconds for the accumulator servicing your car. That's 26 seconds. This is slowing YOU down!

But, the overall rate of cars clearing the toll boths is now 2 X 60/26 or 4.6 cars/min. An improvement over the 3 cars per minute we had before but not the factor of 2 you might expect by simply doubling the number of toll booths. Contention for shared resources is the counter balance to parallel processing. Continue adding toll booths and soon you get NO additional capacity for your effort. There is a fairly simple math function (A Taylor series minus one of the terms) that estimates the throughput of an N resource system given throughput of a 1 and 2 resource system, but I digress. Its also the reason why we don't have massively multi-core computer chips (not to be confused with massively parallel systems).

I love this model as it illustrates the fundamental principles of queuing theory. It shows, contrary to what has been said so far in this thread, your naive college could be right. But then again, even a broken clock is right twice a day.




Bernie Velivis, President Performax Inc