Performance Bottlenecks Debugging -

After getting the issues during the performance testing, we would require an approach for debugging the Performance Bottlenecks. Debugging a web application that crashes under high load necessitates a methodical approach to finding and fixing the underlying issues. Following are some well-defined steps for effectively debugging the performance issues.

Performance Bottlenecks Debugging Approach

Analyse the result: Result analysis is the first step for performance bottleneck identification. Now, the question is, what metrics do you need to concentrate on?
- Response Time:
  - Compare the actual response time observed in the test with the response time NFR
  - Identify the requests or transactions that breached the response time NFRs
  - Open the response time graph
  - Filter the graph with the failed request or transaction name
  - Analyse the graph, whether response time breached throughout the test or at some specific point.
    - Case A: Response Time NFR breached throughout the test.
      - Possible Reasons:
        
        The web page calls an untuned SQL that delays the response
        
        Unoptimized JavaScript
        
        Heavy uncached Image
        
        Length of Data/Content e.g. a thousand lines of list
        
        Downloading Scenario e.g. PDF download
    - Case B: Response Time NFR breached at a specific time.
      - Possible Reasons:
        
        Garbage Collection
        
        Heap Memory Issue
        
        Network Bandwidth Issue
        
        Reduction in request/transaction throughput
- Request Throughput (Transaction Count):
  - Compare the actual transaction count observed in the test with the defined number of transactions. Let’s say, you need to submit 50 orders per hour, so the transaction ‘Submit Order’ (in the script) should be executed 50 times during the 1-hour run.
  - If throughput does not match then the possible reasons are:
    - High Response Time
    - Impact of constant think time or pacing on TPS rate
    - Little’s Law miscalculation
- User Load:
  - Refer to the user load graph and check whether the active user (thread) count is constant throughout the test or not.
  - A sudden drop in user load indicates that the virtual users completed the iterations before the defined duration.
    - Possible Reasons:
      - Application Crashed
      - No think time or pacing
      - Issue with the Load Generator
      - Network Connectivity Issue between Load Generator and AUT; resulted in the test stopped.
- Error Rate:
  - Refer to the error graphs which provide details about the type of errors and the time when they occurred.
  - Error graphs help to categorize the error using the response code. If the response code is HTTP 4XX then the issue is related to client-side input like wrong URL, header or any value. If the response code belongs to HTTP 5XX then it directs the investigation towards the server side.
  - Possible Reasons:
    - Functionality issue (for 100% failure rate)
    - Scripting Issue (for 100% failure rate)
    - Application Failure
    - Stub Failure
Reproduce the Issue: It is an optional step, especially when you do not have server logs, profiler, or monitoring stats. In that case, the developer team have to replicate the issue in a scaled-down environment. They can simply use Apache JMeter to simulate heavy traffic and identify the specific conditions that trigger the crash or bottleneck. This helps in narrowing down the scope of the investigation. If sufficient server logs, dumps, profiler and monitoring tool graphs of the test period, are available to the dev team for investigation purposes then this step is not required.
Investigation of Root Cause: Root Cause analysis or investigation depends on the nature of the error. Root Cause investigation also depends on your skill, experience and functional knowledge about the application. You can investigate some common issues by using some generic approaches which are.
- High Spike in CPU: Possible Root causes are:
  - Repeated Full GC: Using the Heap Dump/Thread Dump root cause of the problem should be identified & fixed.
  - Non-Terminating Loops: When you observe CPU maxing out and utilization not coming go down, you should take 2 thread dumps in a gap of 10 seconds between each thread dump –right when the problem is happening. Every thread in the “runnable” state in the first taken thread dump should be noted down. The same thread state in the second thread dump should be compared. If in the second thread dump also those threads remain in the “runnable” state within the same method, then it would indicate which part of the code thread(s) is looping infinitely. Once you know which part of the code is looping infinitely then it should be trivial to address the problem.
  - non-synchronized access to java.util.HashMap: If a Java thread works on HashMap’s get() or put() API, then it’s indicative that HashMap is causing a CPU spike. Now you can replace that HashMap with ConcurrentHashMap
  - Long Running Process: Often, just one or two processes are increasing the CPU load. For example, a process could be in an uninterruptible state and increase the load on the CPU by keeping all other processes waiting. When the CPU becomes overloaded then identify such processes of this kind and terminate or restart them. Also, keep an eye on such processes in future tests.
- A sudden drop in CPU: Possible Reason:
  - Application Crashed
  - Network Bandwidth Issue
  - Load Balancer Issue: Check if the CPU Utilization of the servers is very high or very low compared to the others.
- Gradually Drop in CPU or Low CPU Utilization: Possible Reason:
  - Insufficient RAM or slow Hard Disk Drive: If your system doesn’t have enough RAM or the hard drive is running slowly, it can cause the CPU to slow down. This could result in lower CPU usage.
  - Overclocking: If you have overclocked your CPU beyond its limits, it can cause instability and lower CPU usage.
- Memory Leakage: Possible Reason
  - Unreferenced object: Using the java.lang.ref package, you can work with the garbage collector in your program. This allows you to avoid directly referencing objects and use special reference objects that the garbage collector easily clears.
  - java.lang.OutOfMemoryError: A prevalent sign is the java.lang.OutOfMemoryError error. This error has several detailed messages that would allow you to determine if there is a memory leak or not.
- The issue in Data Throughput:
  - Flat Throughput even user load is increased: Possible Reason
    - Thread Pool reaches its maximum limit
    - Connection Pool reaches its maximum limit
    - Network Bandwidth Issue
  - Throughput Dropped:
    - Application Crashed
    - High Response Time
- Issue in Database: Analyze the performance of the application’s database and any external services it interacts with. Check for slow queries, improper indexing or bottlenecks that may contribute to the performance bottleneck. Consider optimizing database queries, caching data, or scaling up the infrastructure supporting external services.
Issue Resolution: Refer to the above points, you can see the list of almost all the bottlenecks and their possible reasons. Resolution of the performance bottlenecks is the responsibility of the software development team. Hence you can share your findings and observations with the development team to help them out for preparing a quick and permanent solution to the issues. A small effort of performance root cause analysis fuels the overall cycle of issue resolution.

Some Important Debugging Tips

Always Monitor System Resources during Performance Test: Throughout the load test, monitor the system’s resources, including the CPU, memory, disc input/output, and network traffic. To collect real-time performance statistics, use tools such as Datadog, Dynatrace, or New Relic. Examine any anomalies or resource bottlenecks that might be the source of the crash or bottleneck problem.
Collect and Analyze Logs and Error Messages: Analyse the dumps, error messages, and application logs generated during the load test. Keep an eye out for any particular error codes, exceptions, or warnings that can shed light on the reason behind the performance bottleneck. Frequently, log files can provide important details about the application’s status as well as any exceptions or errors that may have occurred.
Review Code and Application Architecture (For Performance Engineer): Examine the pertinent areas of the issue by delving into the application’s coding. Check for any possible causes of the performance bottlenecks under high load, such as memory leaks, inefficient algorithms, or coding flaws. Additionally, assess the application’s architecture to find any scalability problems or design defects that can be linked to the issue.
Use Debugging Tools: Use profilers and debugging tools to examine how the application behaves during the load test. Certain points of failure can be found by walking through the code, inspecting variables, and using tools like GDB, Xdebug, or Visual Studio Debugger. Code performance and resource usage can be better understood with the help of profilers such as Blackfire or YourKit.
Load Balancer and Scaling Configuration: If the application is distributed among several servers or employs a load balancer, it is advisable to examine the setup for load balancing and scalability. Make sure that the infrastructure is appropriately equipped to accommodate the anticipated load and that the load is distributed equally on all the servers. As necessary, make configuration adjustments to maximise performance and reduce the chances of application crashes.

Conclusion

The role of a performance tester is to measure the performance of an application or system under load whereas the role of a performance engineer is to identify the root cause of the performance issue. Theoretically, neither the performance tester nor the performance engineer can resolve the issue but they can pinpoint the issue as per their skill and experience. Sometimes debugging complex issues under heavy load can be challenging, and it may require a combination of techniques and expertise to resolve the issue.