My experience on how data analysis helped in performance optimization in a practical scenario and what can we learn from it.
Imagine you need to do a bank transaction, but the website is so slow. The page takes so much time to load, all you can see is a blue circle.
Or imagine you are presenting something important to key stakeholders, but your screen is stuck. You wait for a few seconds, and again for a few more but with no luck. It would be terribly frustrating.
System Performance issues are a reality. We face it from time to time.
So much thought and time goes into designing a software application to enhance the user experience. Each minute detail around the layout, the look and feel, and the usability features would be given so much attention. Equally important is to design for optimum system performance.
The system should perform fast. The user would expect at the click of a button, the page should load or transaction should get processed. The experience should be seamless. There should not be any wait times. Otherwise, you might lose a client.
Despite careful designing, development, testing, and implementation, there will be a time when the software applications perform badly for various reasons — businesses may undergo change, the design was not optimal or scalable, there were unknowns not thought through, etc.
How do we approach performance optimization? Why and how does data analysis help in performance improvement? Can we have a framework to predict and prevent performance issues?
I have come across several such scenarios and in this article, I will share some views on each of these questions.
How to approach system performance optimization? — Start with the problem finding.
When there is a system performance issue, everyone will come to know there is a performance issue, but most of the time, they don’t know the exact problem.
There will be several user complaints to the service desk, emails directly from frustrated end users, and possibly an escalation to senior management in severe cases.
At such a stage, it is a common tendency to come up with a quick fix to resolve the issue as soon as possible. While that is the need of the hour and may work sometimes many times, that won’t work.
It requires finding out what exactly is the real problem, and that is hard. The business might have a view of a problem, IT developers and backend administrators might have a view that may differ, management may have an altogether different view.
Performance specialists, database administrators, and application developers and functional consultants, business users — all need to work collaboratively to get to the root of the issue most times.
Questions like the below will get some insights.
- What exactly is the performance issue? It affects which business scenario?
- Is there a performance benchmark? What was the expected performance and what is it now? What is the deviation?
- Is the issue observed by all the end-users or it is specific business users? If the application is global, is the issue specific to any geography or applicable everywhere?
- Has the system performance degraded at a specific time of the day or week? Is there any pattern?
- How is the user accessing the software application? Over office internet? Via VPN? or home network? There could be several variations.
- What top scenarios does it affect? Does the scenario that has an impact matter? Sometimes, there is no correlation.
- Is the issue reproducible?
There could be so many such questions. With the right set of questions, you get the insight to work towards finding the real problem.
How does Data Analysis Help?
Once we gather the info from the users, IT developers, and support staff, and management, we get a sense of the problem much better than before. However, we don’t really know the exact problem in most cases.
Is it because of poor software design or coding? Are the system resources too low? Has the transaction volume grown significantly causing the issue? Is business causing the issue by following scenarios and actions that should not be done? There will still be many such questions for which we still need to find out.
This happens because of multiple reasons. I have seen cases where no one had a complete picture. Each stakeholder looks at the issue from their vantage point, which may or may not be the right viewpoint. But we do get a sense of the problem to proceed further.
That’s where data analysis helps a great deal. Data can’t lie. If the order processing is slow, data shows — for specific orders, what time of the day, when, etc, you can drill down and find a great deal about when the problem started, what has changed, etc.
I will illustrate this with an example.
I worked in an environment where Oracle EBusiness Suite ERP and Oracle Mobile Field Service (MFS) applications were implemented for their business operations.
Oracle EBS database would have all the required data of the enterprise. The field staff would get their service request information synched onto their laptop regularly to service their clients. This synchronization process would run between Oracle MFS and Oracle EBS in the background.
This data needed to be real-time, with a maximum lag of say 5–10 mins, not more than that. The synchronization process needed to perform extremely well otherwise field staff wouldn’t get the necessary data to service the clients.
In this context, the system was performing badly. The field staff had raised several complaints that the synchronization process was slow. They were frustrated.
Most users reported this problem globally. Sometimes, they said, they had to wait several minutes, sometimes even hours. It had affected not only their work, even personal life too.
Everyone had their own thoughts and ideas on what was the problem and fix. Even short-term fixes like query tuning, increasing infrastructure resources like CPU, memory, etc). Nothing had worked for long.
What we needed was to take a step back and understand what exactly was the problem and fix it forever. That's where the data analysis came into help.
A model that was an outcome of data analysis. There was no data model to show the performance problem until that point. It was a bit tricky to get to that point in an integration scenario, but a simple view as below helped all the stakeholders, which looked like the below.
As you can see in the above, the acceptable limit for business was less than 2 mins for data synchronization, and no longer than 10 mins even for a larger data set, but in reality, 73% of the data synchronization was taking longer than 10 mins. There was massive improvement needed.
No one was aware of this data until that point. Everyone had their own perception!
Business users thought the problem was much worse than this, whereas IT developers thought the situation was better than this, the real problem was somewhere in between.
While this may look like a simple outcome, in a practical scenario, it would be hard to come up with a simple view. Keeping things simple is difficult. Even this simple view no one had!
As a next step, we clearly laid out the target — 90% data synchronization in less than 2mins, 6% between 2 to 5 mins, 4% 5–10 mins, in discussion with the stakeholders.
It is amazing how such data analysis helps various stakeholders even in purely a technical problem. I have seen this practically useful at IT developers level, business user community level and even at program board meetings.
Should we go ahead with the rollout to other global regions? How long this may take to fix? Will we be able to deliver services to customers within agreed SLAs? The data analysis and trends may answer a lot of questions.
Once this was established, the as-is, and the targets were shared and communicated in layman terms so that even a non-technical person could easily understand.
Once the problem is defined, half the battle is won.
It is extremely important to clearly identify, state, and communicate the problem which brings in all the stakeholders on the same page and to avoid ambiguity. This generates positivity to work towards that target.
There were a number of solutions that could be done in this context, which is outside the scope of this article, but briefly, architects, developers, business analysts, and users were involved in taking this forward.
Identifying top 10 SQL queries consuming more time which were getting executed beyond 15 mins, tuning them by either rewriting them, optimizing execution through indexing, or altering execution plans and overall performance tuning.
Another approach was batching. Split the overall execution from serial processing to parallel processing, creating threads. Improve the execution time of each thread. Another solution was to reduce network delay because of data volume.
A combination of these solutions helped to solve the problem and improve performance in this context. The above data analysis also helped to validate and verify the results to satisfaction after the optimization.
A predictive model for performance analysis
I have come across various performance problems in a production environment. Sometimes a single SQL query performing badly had affected the entire system. This had happened because of an overnight gather statistics batch program run which had altered the execution plan of the query.
In another instance, the order shipment process had severe lag affecting the employees on the floor at the warehouse. We face this often in spite of the measures in place.
Performance specialists are quick to identify and put fixes in such cases by applying better execution plans, SQL profiles, etc.
What I have found useful is to have dashboards for performance monitoring and preventing potential issues upfront, and regularly.
This is tricky, but effort in this direction goes a long way in building best-performing applications. The system can have intelligence and predictive capability based on past data and trends to predict the performance issue.
Let’s say, the system can predict the inflow data volume much higher than a regular day, we can build intelligence in to alert for potential performance issues downstream and possibly fix them.
Building a data analysis framework and dashboards is important so that we carry out the analysis at regular intervals to alert administrators, architects, and the business as needed. Simulate the data volume growth, having a strategy for archival and purging, options are so many.
The whole exercise of performance improvement doesn’t follow a fixed pattern. An approach successful in addressing one kind of problem may not work for another, although it may look like a similar problem.
As much as a system performance issue is a design and technology problem, the solution needs more human involvement and management of expectations and emotions. We need to work with actual people in a practical environment.
This exercise is partly software engineering, partly it requires broad skills which are a combination of art, communication, and humanity along with technology.
Data analysis not only helps to identify the root of the problem, but it is also essential to bring all stakeholders on the same page and manage expectations and deliver results. So, we must give this due time and focus when we approach performance improvement.
(If you found the above article useful, you may find the below useful too how to approach solving complex problems)