Posted on July 15, 2010 by Ashish Soni

This Week in Cloud Computing: Interview & Demo with Ringio President & CTO Ashish Soni
I just did a live screencast interview and demo of the
Ringio Product on the
ThisWeekIn Webtv network specifically on the cloud computing channel with
Amanda Coolong.
One of the takeways I had was that the idea of Ringio being an intelligent google voice for business resonates. People get it.
The broader pattern in general is that when describing anything new – it is easier to describe it relative to something else that is already reasonably well understood.
For techies, this is the same paradigm that is used in Relative Estimation in Agile Development
But I digress….Thanks to Amanda for being a great host. You can see the screencast here.
Filed under: Cloud Computing, Ringio | Leave a Comment »
Posted on July 8, 2010 by Ashish Soni

High Availability In the Cloud
(Editors Note : This has been cross posted from the Ringio Blog)
The typical factors considered when evaluating the ROI of the cloud compared with traditional data centers are:
- Machine utilization, Elastic Demand and Auto Scaling. Most services do not need all servers all the time. The cloud allows you to scale up and down and reduce cost at low demand times. This is especially well described in Joe Weinman’s cloudonomics blog where he states that “even if cloud services cost, say, twice as much, a pure cloud solution makes sense for those demand curves where the peak-to-average ratio is two-to-one or higher.” In a traditional setting you would have to be provisioned for peak demand all the time.
- Power. In physical data centers the cumulative power cost outweighs the machine cost somewhere in year 2 or 3 onward. In the cloud – the unit machine cost includes the power cost.
- Human Resources to run the data center. Due to no physical work such as cabling etc the expected personnel count required to support data center operations is less in the cloud.
However, in addition to the above mentioned factors, one also needs to look at the out of the box high availability tool set that is provided by the cloud in order for a more thorough analysis.
The cloud provides the following benefits in the High Availability arena at a fraction of the costs in a traditional data center setting.
- Geographic Redundancy and Diversity
- The Cloud Way: The cloud out of the box gives you the power to instantiate your services in geographically diverse regions.
- The Traditional Data Center: To do the same with your own data centers is very costly / time consuming / resource intensive. In addition you need to pay for inter site connections and point to point links. This can cost in the hundreds of thousands dollars extra. Now you need to instantiate 2x the peak demand in any one data center assuming that any one data center can take over all of your traffic.
- Point In Time Data Snapshots
- The Cloud Way: The cloud gives you block storage that allows for snapshotting which allows for data protection and point in time recovery.
- The Traditional Data Center: The same capabilities can cost tens of thousand of dollars or more when being implemented at local data centers by the way of SAN’s and NAS’s.
- Shared Data across Multiple Geographic Regions
- The Cloud Way: With a highly available Network Attached Like Storage your data is available in multiple regions. In additions block data can also be snapshotted and made available across multiple regions
- The Traditional Data Center: You need to work with ISP’s about bandwidth and Point To Point(P2P) connections and then worry about the availability of these P2P connections. You need to purchase SAN’s and NAS’s at each site for the data to be replicated.
- Load Balancing & Basic Monitoring
- The Cloud Way: The cloud provides this at a basic level and gets you off the ground quickly.
- The Traditional Data Center: You don’t need to implement your own solution unless you really need to do some advanced loadbalancing for example.
One of the key disruptions of cloud computing has been the commoditization of High Availability offerings, some of which are mentioned above. This is a boon for all startups like Ringio and for any existing enterprise that is looking at the cloud and is also looking at high availability.
In my view when comparing the two approaches it’s not just the cost to run a service but the cost to run a service in a highly available manner where the cloud truly shines.
photo credit – calle vieja blog.
Filed under: Cloud Computing, High Availability | 2 Comments »
Posted on April 19, 2010 by Ashish Soni
There has been a reason why I have been missing in action over the last 2 months. The reason has been that I along with my partners, Sam Aparicio and Michael Zirngibl have been overdrive in getting ready for the launch of our new venture – Ringio.com.
Well launch day is finally here today! As the press releases are beginning to hit the wires we are seeing traffic and users increase. The excitement is palpable. How will the business do? How will the system do? How is the cloud behaving? What feedback are we going to get? These questions and many more are racing through our heads.
Much more to come in terms of an insider’s perspective on the launch… Stay Tuned!
Filed under: Ringio | Leave a Comment »
Posted on February 24, 2010 by Ashish Soni

In my earlier post about concurrency control I mentioned that due to the exponential characteristic of response time with increasing concurrency it is better to potentially reject the extra requests than letting them negatively affect your system. While rejecting these requests is an option it is not the most desirable option. Request Queuing offers a much more powerful solution as explained below.
Request Queueing Example
For simplicities sake lets assume the following base response time characteristics for a system with no concurrency controls or request queueing:
- at a concurrency of 500 or below the average response times are 1 second.
- at a concurrency of 600 the average response times are 2 seconds for all requests.
- at a concurrency of 700 the response times are 4 seconds for all requests.
The exponential degradation starts after a concurrency of 500. With the above numbers in mind lets assume that you have set up your concurrency controls optimally such that 500 concurrent requests can get through and the next 1000 requests (requests 501-1500) are set up to wait. That is : The Max Queue Size is set to 1000.
Now lets say you get a sudden traffic spike of 1501 requests. Following is what will happen
- The first 500 requests will get through and be served on average in 1 second. The next 1000 requests(requests 501-1500) are queued . The 1501st request is rejected.
- As requests 1-500 finish the requests 501-1000 are sent forward. So requests 501-1000 will be served in 2 seconds total. (1 second execution time and 1 second queue time)
- As requests 501-1000 finish, requests 1001-1500 are forwarded on. So requests 1001-1500 will be served in 3 seconds total. (1 second execution time and 2 second queue time)
Characteristics of Systems With Request Queueing
- Request Queuing allows your system to operate at optimal throughput. In the above example : the optimal throughput was at 500 concurrency. The concurrency at which optimal throughput is achieved is usually right below where exponential degradation starts to take place. At all times the system was operating at this optimal throughput
- Your users only experience linear degradation versus exponential degradation. As shown in the diagram, with no request queueing your users and your system would have experienced exponential degradation after 500 requests. Requests 0-500 take 1 second, 501-1000 takes 2 seconds, 1001-1500 take 3 seconds and so on – With Request queueing – the response times become linear
- Your system experiences NO degradation – This is worth repeating. The system is always operating at an optimal throughput. The only attribute that is dynamic is the queue size. The system remains in the green zone as highlighted in the diagram.
If even during times of unusually high load your system can show the above 3 characteristics you are in good shape.
Haproxy is one solution that can do both concurrency control and request queueing and is one of the solutions that should be considered for high availability.
Filed under: High Availability, Software As A Service, operations | Tagged: concurrency control, ha, request queue | 7 Comments »
Posted on February 5, 2010 by Ashish Soni
One important high availability principle is concurrency control. The idea is to allow only that much traffic through to your system which your system can handle successfully. For example: if your system is certified to handle a concurrency of 100 then the 101st request should either timeout, be asked to try later or wait until one of the previous 100 requests finish. The 101st request should not be allowed to negatively impact the experience of the other 100 users. Only the 101st request should be impacted.
Why Concurrency Control? :
Most systems exhibit Response Time patterns that rise exponentially after some critical concurrency limit(Escape concurrency) for the system has been reached- as shown in the diagram. With concurrency control you can protect the system from entering the Critical Zone where the exponential response time behavior is experienced.
Benefits of Concurrency Control :
- Avoiding Cascading failures. Imagine a sudden surge in traffic which goes above your system limits. Lets say that your database becomes the bottleneck. As the database slows down threads on the Application Servers will begin to pile up. Now the layer above your Application Servers your web servers/load balancers) will start to pile up threads as well. Very soon your system reaches a state where every layer of the architecture has been compromised – the system has reached its escape concurrency – you have lost control of the system. By controlling the amount of traffic that you take in you avoid this situation.
- Flexibility. The ability to control the amount of traffic that your system receives is powerful. For example : After a release you realize that in production the scalability profile of your system has changed for the worse. You were initially certified for a certain concurrency but all metrics are pointing to the fact that the system will not handle this load. Now you can reduce the concurrency allowed. While this is not an ideal situation it is much preferred to a system meltdown.
- Guarantee of Service Quality. One can be more confident in the Quality of Service being provided to the requests that are allowed through.
One Solution
Haproxy is good place to start to investigate. It offers a myriad of features such as load balancing, content switching in addition to concurrency control. Concurrency limits can be set at a global level and/or at a server level. Importantly it also offers request queuing. This is extremely powerful when used in concert with concurrency control. More on this in a later post. I recommend that you look at haproxy when designing high availability systems.
Concurrency controls should be put in place in concert with a capacity planning program. If indeed system concurrency limits are being approached one better start to scale the system and then increase the concurrency allowed.
Filed under: High Availability, Software As A Service, operations | Tagged: ha, operations, saas | 3 Comments »
Posted on February 1, 2010 by Ashish Soni

Organization Impact of Poor Software Quality
Most Companies look at Software Quality through one of the following 2 lenses
- A Technology Executive may look at software quality so as to determine whether the software is in a quality state to be released or not?
- A COO may look at Software Quality from the angle of customer churn. What percentage of customer churn was caused due to poor software quality?
Both of the above views are indeed critical. However one angle that typically does not get attention is the organizational impact of software quality. As shown in the diagram poor quality can have a reverberating impact across the whole organization.
The diagram highlights :
- Poor Quality in production causes the the whole company to be interrupt driven. As and when the customers experience poor quality the whole organization experiences the pain. You lose control of your schedule. You are on the customers bug discovery schedule. This is the main reason for schedule slips of currently ongoing projects. People lose focus on the task at hand and are thrust into a production patches or production crisis resolution.
- Poor Quality effects each and every department. The diagram shows how Customer Support, Sales, Quality Engineering, Engineering, Operations, Product Management are all effected. Each one of these departments is interrupted from their current focus.
- Poor quality leads to an order of magnitude more work for the organization. If an issue could have been found/resolved within the Quality/Engineering Teams at most this would have been a 2 or 3 step process encompassing 1-2 groups. However once the issue is exposed to the customer this becomes a 13 step process (at best – assuming that every one of the 13 steps goes without a hitch) encompassing the whole organization.
My Recommendation (2 Critical Metrics):
- Number of patches required after release. Each time a patch is released the whole organization goes through hoops as is highlighted in the diagram. If this is not brought down to Zero (or close to it) you will be falling behind on your next release. You will be in a vicious cycle. Start to measure this and bring it down. This is one of my most important metrics when measuring the job that quality does.
- Opportunity cost per Patch. What is the man week cost of a patch on average for the organization? This will give an idea of the scale of the problem. Bring this down with automation.
By measuring the above metrics you can start to plan better by taking into account your historical patch cycle times and patch frequencies. Secondly this can help in highlighting to non technology executives the importance of quality across the organization.
Filed under: Management, Quality, Software As A Service | Tagged: leadership, mgmt, Quality, saas | 3 Comments »
Posted on January 3, 2010 by Ashish Soni
When it comes to managing and leading teams, there are some principles that are borrowed from managing highly available systems. As an Operations Managers, your job is to keep the systems up and running. As a manager of your team your job is to consistently keep your team executing at a high level . A few things to think about
- Can you handle a person or two leaving from your team with minimal effect?
- Can your other resources handle the increased workload/pressure that may come due the above departure?
- What is your MTTR(Mean Time to Recovery) to get a replacement resource?
- Is there a cluster of critical skills that only live with 1 or 2 people? What would you do if they left or threatened to leave?
- How do you re-energize your team regularly? (akin to server restarts)
A few ways to make your Team Fault Tolerant :
- Make sure that you spend time in adding processes and sharing programs so that you can survive an employee departure or two.
- Ensure that you have a great training and on-boarding program for new employees as this program will be used regularly.
- Have a skill matrix of your team (more on this in a later post) to help find skill risk hotspots for your team.
The median length of employment is 4.1 years as mentioned here. This means that if you have a team of 10 people expect a churn of 2-3 people per year on average. The statistics tell us that churn is expected. But when an employee departure actually occurs a manager is typically caught off guard. Readiness to handle the eventuality of employee churn with minimal impact to the team is expected out of a good manager.
Filed under: Management | Tagged: leadership, mgmt, team | 1 Comment »
Posted on December 16, 2009 by Ashish Soni
How many times have we all run across a situation where the performance tests on a piece of software pass with flying colors on the test systems only to see the software exhibit poor performance characteristics when the software is deployed in production?
A lot of time is wasted on verifying(and re-verifying) the validity of the test, checking for hardware/network differences, checking hundreds of parameters on the test systems versus production in the hopes of finding a meaningful difference.
By far the biggest reason I have seen for the performance discrepancy above is not due to a faulty test but due to the stress test being executed on wildly different data sets than what is in production. The data sets between production and test systems in many cases are an orders of magnitude different in size and richness.
A quick example to make the point
- Lets say that there are 1 million users in production of which 1,000 at any one time are using the system.
- Lets say that the new system requirements(after some scalability refactorings) are to support 2,000 concurrent users.
- This is typically simulated by creating 2,000 users and then scripting appropriate actions for these 2,000 users simultaneously. Typically all is great, everyone high fives each other and the release is scheduled
Technically the test is correct. It simulates 2,000 concurrent users. However the data that the action is performed on is almost 3 orders of magnitude greater on production(1 million users versus two thousand)
It does not take much for a non optimized SQL query or a full directory scan on a NAS to cause the slow down in production
Ensure that the data set you run your stress tests on is representative of the data set in your production system. This is a great way to gain confidence in your Data Storage Layer and this also appropriately tests how your software interacts with this layer.
One Solution:
One simple and easy way to run meaningful performance tests is to take a snapshot of your production data (minus any personal/private information of course) and to execute the stress test on this data set. I prefer to go one step further still and make sure that the data set in the quality system is 2x or 3x of what is in production.
Filed under: Quality, Scalability | Tagged: Quality, testing | 13 Comments »
Posted on December 11, 2009 by Ashish Soni
I came across this post on the perfect interview test and I have to say that the thought of interviewers secretly recording candidates road crossing habits is quite hilarious.
The idea that a set of random people crossing the street (literally in this case) would be better hires than the selected people after a little more ‘conventional’ interview process means that the hiring manager is actually a liability during the interview process. This means that the average and above average people are weeded out by the conventional interview process of the hiring manager.
In addition, the post attributes the above strategy to the people who wrote the book titled I Hate People. Obvious question: Why are you hiring in the first place then? You hate people.
If you have a hiring manager who is thinking about following the above strategy I have the following 2 recommendations
- Hire the people that this hiring manager usually rejects. As shown above the average and above average people are weeded out by the hiring manager – so hire the weeded out people in this case. OR
- Replace the hiring manager by a random hiring manager crossing the street. Chances are that this would be an upgrade.
Filed under: Uncategorized | Tagged: funny, hiring, interview | 1 Comment »
Posted on December 1, 2009 by Ashish Soni
In many instances I have heard people discussing and then trying to measure the mean time between failures -(MTBF) of components in their architectures. While this may be an interesting exercise it is typically misguided.
The focus needs to be toward measuring the mean time to recovery (MTTR) after failure. Assume the failure. Going through this exercise very quickly puts the critical issues at the forefront.
For example one may have the following analysis
- Java OOM (out of memory) : MTBF = 1 week. MTTR = ZERO (loadbalancer redirects traffic to other node and the OOM node restarted automatically)
- 1 critical Data Pipe with 99.9% SLA provided. Lets say that MTBF = 5 years. This means that if and when the downtime occurs – lets say once every 5 years — theoretically you could be down for 40 hours! and still be at 99.9% SLA.
Would you be OK with a MTTR of 40 hours? Probably not. Imagine that this 0.1% failure case happened tomorrow and then for the next 5 years all is good.
Typically what happens is that we deal with what we see and feel. In this case the weekly out of memory issue would be getting all the attention. However be careful of the silent but business crushing High MTTR.
Go through a MTTR exercise for ALL aspects of your architecture and business. The path to high uptime will become clear.
Filed under: High Availability, Software As A Service | Tagged: ha, operations, saas | 5 Comments »