Bottleneck identification is an art in Performance Testing. Without knowing this art you can not be a perfect performance tester. There are certain performance test result analysis technics which help to conduct a proper approach to find the bottleneck and its cause. Before starting an investigation on the bottleneck, you can apply the WHIWH rule. WHIWH rule suggests a way how to lead the bottleneck identification process in the right direction. Usually, the performance testing window in the project plan is smaller than the functional testing window and close to the Go-Live date. Being a performance tester you must have to complete the testing in a short period and not only complete the testing but need to identify bottlenecks as well. WHIWH rule is very helpful in such a critical situation:
What does WHIWH stand for?
- What are the errors?
- How many errors?
- In which condition do errors come?
- When do errors come?
- How long do errors persist?
Explanation of WHIWH Rule:
1. What are the errors?
The first thing which you need to check is ‘Type of Error‘. Without knowing the type you can not investigate. During the performance test, you may get different types of errors like client-side errors, server-side errors etc. Therefore identify the nature of the errors and categories them to make the investigation easy. Some of the error categories are below:
- Scripting errors:
- Missed to replace old URLs in some places: This causes the failure of a particular transaction.
How to resolve: Better to parameterize all the URLs. - Lack of test data: Sometimes tester forgets to set “recycle the test data list” or “continue with last value” in the parameter setting which leads to either failure of the transaction or user exit.
How to resolve: Keep some extra test data than no. of total iteration in the test. Also, use proper parameter settings like keep continue the user if data end or continue with last value or recycle the test data if ends etc.
- Missed to replace old URLs in some places: This causes the failure of a particular transaction.
- Testing Tool side errors:
- LG failed during the test: This issue occurs when LG(s) are unable to handle the user load during the test.
How to resolve: Before starting the test calculate the required LG count using a calculator Click here - LG memory issue: This issue comes when there is not enough memory in LG. LG may get disconnected in between the test.
How to resolve: Check the available memory of LG before starting the test
- LG failed during the test: This issue occurs when LG(s) are unable to handle the user load during the test.
- Client-side errors:
- HTTP 4XX: Refer response code graph, if the test result has an HTTP 4XX error then this indicates something wrong has been sent to the server.
How to resolve: You need to validate test data, URL and correlation. - HTTP 4XX followed by 5XX: Again this is a syntax of client-side error, but not always. This is a tricky scenario where you have to carefully analyse the previous requests having a 4XX response code and the following request having a 5XX response code.
How to resolve: You need to validate test data, URL and correlation in the previous request having 4XX.
- HTTP 4XX: Refer response code graph, if the test result has an HTTP 4XX error then this indicates something wrong has been sent to the server.
- Server-side errors:
- HTTP 5XX: Such status codes fall under the server-side issue. This could be due to server failure, application failure, server instance failure, gateway timeout etc.
How to resolve: You need to carefully analyse the error, look into the error description and also need to check server logs to find out the exact cause of the issue. - Slow response from the server: If the server is unable to handle a large number of requests and its performance degrades, then it impacts transaction response time which may lead to ‘Timeout’. In Micro Focus LoadRunner, this error is highlighted as “Step download timeout”. The default value is 120 seconds which you can increase as per your requirement.
How to resolve: You would need a developer’s help to tune the application/system for a particular request/page/DB query.
- HTTP 5XX: Such status codes fall under the server-side issue. This could be due to server failure, application failure, server instance failure, gateway timeout etc.
- Network error:
- Increase in Latency: Refer to the network latency and throughput graph, if network latency increases with flat throughput graph then could be a network bandwidth issue.
How to resolve: Increase network bandwidth
- Increase in Latency: Refer to the network latency and throughput graph, if network latency increases with flat throughput graph then could be a network bandwidth issue.
2. How many errors?
The more errors in the test, the more instability in the application.
The percentage or quantity of errors has its importance in the result analysis. The error tolerance limit of some applications is 3-5% whereas some banking system has 0% error tolerance. In that case, performance test result needs to analyse thoroughly. All the NFRs and defined SLA should meet in the test. It is always advisable to accept the actual test result instead of assuming the figure by ignoring the error count especially when they are less.
Example: As per NFR, CPU utilisation should be less than 70%. But during the test, you observed average 69.8% CPU utilisation for a longer duration with few errors then highlight it to BA or developer (because you have 0% error tolerance). The CPU utilization may cross 70% in the production server.
3. In which condition do errors come?
Analyse the condition in which errors come whether they occur during ramp-up, steady-state or in a pro-long test.
Generally,
- Errors during the ramp-up period are due to the incapability of the server to handle gradually increasing load.
- Errors during steady-state are due to queue pile-up, server unavailable, the network bandwidth issues.
- When errors come in a pro-long test (Endurance Test) then memory leakage could be one of the reasons.
The above errors and their causes are very generic, they could be different in your cases. So, better to analyse the result thoroughly before confirmation.
How to resolve: Monitor the servers using the server monitoring tool or APM tool. Heap Dump, Thread Dump and GC logs give you more clarity of the root cause. Also, analyse the server log with the help of the developer to identify the exact bottleneck.
4. When do errors come?
The exact time helps you to identify the cause of the bottleneck while rendering in server logs. You can merge error graph with other graphs to see what is the impact on the test due to a particular error at a particular time. The developer expects a summarised report in which exact error timings are mentioned. This helps them to save their time and resolve the issue as quickly as possible.
5. How long do errors persist?
You need to check whether those errors last for a small duration or a longer duration. The causes of small duration (one peak) error may be due to user exit, wrong test data for some of the entries, the server failure and recovering quickly (you can see a spike in the response time), back-end server activity due to shared test environment etc. The causes of longer-duration errors may be server failure, DB failure, queue pile-up, response timeout, LG failure or network issues in communicating with LG.
WHIWH rule is the easiest technique to perform the root cause analysis during the performance testing phase. Proper implementation of the WHIWH rule provides accurate and quick results.
You may be interested:
- ComCorDEEPT Technique
- Compare Method
- Correlate Method
- Drill down Method
- Eliminate Method
- Extrapolation Method
- Pattern Method
- Trend Method