The most common flaw in software performance testing

How many times have we all run across a situation where the performance tests on a piece of software pass with flying colors on the test systems only to see the software exhibit poor performance characteristics when the software is deployed in production?

A lot of time is wasted on verifying(and re-verifying) the validity of the test, checking for hardware/network differences, checking hundreds of parameters on the test systems versus production in the hopes of finding a meaningful difference.

By far the biggest reason I have seen for the performance discrepancy above is not due to a faulty test but due to the stress test being executed on wildly different data sets than what is in production.   The data sets between production and test systems in many cases are an orders of magnitude different in size and richness.

A quick example to make the point

  • Lets say that there are 1 million users in production of which 1,000 at any one time are using the system.
  • Lets say that the new system requirements(after some scalability refactorings) are to support 2,000 concurrent users.
  • This is typically simulated by creating 2,000 users and then scripting appropriate actions for these 2,000 users simultaneously. Typically all is great, everyone high fives each other and the release is scheduled

Technically the test is correct. It simulates 2,000 concurrent users. However the data that the action is performed on is almost 3 orders of magnitude greater on production(1 million users versus two thousand)

It does not take much for a non optimized SQL query or a full directory scan on a NAS to cause the slow down in production

Ensure that the data set you run your stress tests on is representative of the data set in your production system. This is a great way to gain confidence in your Data Storage Layer and this also appropriately tests how your software interacts with this layer.

One Solution:

One simple and easy way to run meaningful performance tests is to take a snapshot of your production data (minus any personal/private information of course) and to execute the stress test on this data set. I prefer to go one step further still and make sure that the data set in the quality system is 2x or 3x of what is in production.

13 Responses

  1. I would think performance tests are always run on a system which has a snapshot of production data. Not doing that would be plain dumb,

    the main reason we see differences between performance metrics on real production versus test runs is due to the difference in usage scenarios.

    • Though one would be surprised how often the rule of running performance tests on snapshots of production data is broken. In almost all cases where this process has not been followed is when the process to get the production snapshot into the test environment is not automated. Typically a production snapshot is taken pseudo manually and then the snapshot is not updated frequently enough.

      • In my previous company we used to run our performance test on production snapshot data but one problem I found with OLTP systems is that the data becomes obsolete soon. There were lots of screens that were showing yesteday to future data and being an OLTP system most of the data was in the current week only. After running the original tests if the offshore team comes with all the fixes, by the time the test were ran again the data was obsolete. The point I want to make is just even if you get productions snapshot you need to tweak it to simulate real live scenario.

  2. Snapshots of production data can be either forbidden for data protection reasons, or impractical simply due to logistical reasons.

    We have a bit of both problems. Our production data are so large and change so often, we don’t actually have enough tin to copy them into a test system (although we could do a subset, it would immediately be out of date).

    So when trying to test stuff that involves large data, I have had to write a synthetic data driver, which essentially reproduces “production-like” data in a non-production system, in a configurable way with a specified volume. This is also better than production, where volume varies from day to day (making comparative testing tricky). The synthetic data driver can generate data significantly faster than I can copy it from production.

    • Thanks for the comment Mark.
      I understand that sometimes it becomes impractical to copy complete production data sets to the test systems. Its good to see that in your case that ‘production-like’ data sets are generated. This is certainly a reasonable next best approach if indeed due to internal policies and logistics the ability to get snapshots of production data is not present.

  3. > Snapshots of production data can be forbidden for data protection reasons

    If you’re on Oracle DB, you can use Data Masking option for data protection. It takes the production data and scrambles so that they are still representative but not real anymore.

    > the main reason we see differences between performance metrics on real
    > production versus test runs is due to the difference in usage scenarios.

    If you’re on Oracle DB, that’s what Real Application Testing does. You can record real database “traffic” on production and replay it on the test system. A great way to test whenever you change something (DB upgrade/patching, hardware replacement, schema change, etc).

    Hope you will find that knowledge useful,

    DISCLAIMER: yes, I work for Oracle

    • Hi Michal,
      What you say about Recording real production traffic and to replaying it on a test system is a great way to get as close to a production profile as possible. In the past we were also thinking of strategies to record front end traffic and trying to replay the traffic at the front end layer of the test systems. That theoretically should simulate exact production traffic and get many more layers than just the database involved. We made some progress but not as much as I would have liked.

  4. Another reason for the disconnect between in-the-lab stress tests and real-world deployments is the nature of the traffic that the front-end of your application has to manage.

    In the lab, it’s common to run tests over a fast, local network, often with heavy use of HTTP keepalives and simple client settings to help tune the application to get the best possible performance. You might get 2,000 transactions-per-second, 100% CPU and be happy that you’ve configured their app to the best of its capability.

    You’ll get a shock when you put the application live and the latency is terrible, TPS never exceeds 10% of what you got in the lab and the CPU never exceeds 20%. What gives – why is the performance so poor?

    On real-world networks, connections are slow, keepalives are held idle for many seconds and jitter and packet loss further increases the duration of each connection. As connections take longer to complete, the number of concurrent connections grows, and with thread or processed based applications, concurrency is a real killer.

    One solution is to deploy a front-end proxy to convert these many slow connections with inefficient use of keepalives and HTTP options into a small number of fast, local connections, putting your application back into the environment it works best. That’s one of the things that products called ‘ADCs’ (‘Application Delivery Controllers’) do.

  5. I was asked to do a benchmark of a document server (get document by ID, basically). I was told to pick random ids and just blast it.

    I begged them not to have me do this. I told them this would prove absolutely nothing and would be worthless data and a waste of time. The top request rate I would get would be maybe 10% of what our barely loaded live servers got, and therefore incredibly obviously incorrect. The real live requests are very very cache-friendly.

    They said “Maybe. Do it anyway.”.

    My report ended up saying that the new hardware was benchmarked to top out at 10% of what the old hardware was doing already. We rightly ignored it and put it in production anyway. Worked like a charm, 5% load if that.

    One of the reasons I quit working there.

    • I wonder if you were able to do the same test on the old hardware. That is, to retrieve random documents. In your case you basically bypassed the cache layer. Would have been interesting to see the result delta between old hardware and new hardware in this case. Since it is indeed difficult to simulate traffic loads that exactly mimic production traffic the ‘trend’ of the test results is sometimes useful. For example in your test one could theoretically state that the “The new hardware is X% faster/scalable than the old hardware when random id’s are retrieved”. May not be exactly the production simulation but at least gives a relative benchmark.

  6. Another problem is to reproduce real user activity. Hitting same pages over and over will acomplish nothing. Do it with or without sessions, going through geoip recognition or not, sending tracking cookies or not, being logged in or not, submitting error forms etc etc etc.

    Having production alike DB size is important but another important factor is to be able to reproduce real world activity.

  7. Yes, testing with production-level dataset, on the real production hardware, and really across the internet is the only way to know that your site will hold when you need it to. This is what we do here at SOASTA: massive-scale load testing, from the cloud!

  8. Thank you for the article. I enjoyed reading it. You have a very good site.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

%d bloggers like this: