How The Audit Trail Helped Me Understand My Users And Saved $9,430 Per Month

How The Audit Trail Helped Me Understand My Users And Saved $9,430 Per Month

Monitoring Is King

Monday morning calls about a very serious ticket from production are a better eye-opener than even a cup of Navy coffee. I was on the support line when it happened. I was taking care of bug triage and explaining to the user why the current functionality in production is not a bug but normal system behaviour. Suddenly, I’ve received information that our system doesn’t work at all. No response and a blank page woke me up immediately. Although the application was a big ball of mud, according to our monitoring system, it quickly turned out the issue was related to database performance. I kindly informed the frontline manager what the situation looks like and asked for extra time for investigation.

The bug looked extremely bad. If you are familiarizing yourself with the Azure SQL Database plans, you know that if you reached any type of limits your database would stop working. We reached one of them – Max concurrent workers. Fortunately, the recommendation to temporarily upgrade the Azure SQL Database plan was accepted which raised this limit. The application started to work and we now had extra time to figure out why the application wants an unexpected number of workers.

Putting Out The Fire

Putting Out The Fire

We had a great opportunity to play the “blaming game” and talk again about how horrible the legacy application was and how important rewriting it would be. However, we took a different approach. We wanted to make the Client happy by stabilizing the application and reducing the cost. We hoped that this would increase the Client’s trust in us enough so that we could rewrite the entire application.

cost chart

When we upgraded the plan, the cost was “upgraded” too. We started from the position where the estimated cost was around $8432 per database. Due to the fact that we used geo-replication, we had primary and secondary databases. Both of them are required to have the same service tier, so the cost doubled. We used HS_Gen4_24 compute size to have around a 20% margin limit. We were ready to dive into the issue and fix what caused this situation. Was it the newest code change or abnormal traffic network that demanded special treatment?

Find The Needle In That HaySTACK

Find The Needle In That Haystack

Together with my colleagues, we identified some possible causes for this situation. Nothing gave us unambiguous answers to why the database used the maximum limit. When we thought all hope was lost, the answer came when we started looking at our own audit trail mechanism. It turned out the application had to process many time-consuming operations in the background. To make a long story short, the application consumed many records that had 1 MB or more (relation database doesn’t like varchar(max) columns).

According to our audit logs, we knew the issue was related to heavy load and poor implementation. This load was not evident and hidden by a complex custom queue mechanism. Custom audit trail helped us understand how the business flow works and realize sequential operations. After we made the necessary changes, our usage looked much better.

MaxWorkerChart

After a while, we were able to use a lower cost plan. We were so happy to see the following cost analysis chart (selected primary database). The situation was stabilized and we could go ahead and start talking about how to rewrite the application.

Image

How To Get Started With Audit Trail

There are a lot of different approaches on how to collect audit logs. In my opinion, the easiest way to start is using a special table that is designed to keep the full history of data changes (it’s called a system-versioned temporal table) or a custom mechanism. Your own mechanism can use ORM change tracking ability and capture changes every time when you save data. The following audit log comes from my Audit Trail plugin.

Image

I used EnitityFramwork’s track ability to determine type of modification and build audit object that collects necessery information about this change. The full implementation you can find in GitHub here.

foreach (var property in entry.Properties)
{
    string propertyName = property.Metadata.Name;
    if (property.Metadata.IsPrimaryKey())
        auditBuilder.AddPrimaryKey(propertyName, property.CurrentValue);
    
    if (entry.State == EntityState.Added)
        auditBuilder.NewValue(propertyName, property.CurrentValue);

    if (entry.State == EntityState.Deleted)
        auditBuilder.OldValue(propertyName, property.OriginalValue);

    if (entry.State == EntityState.Modified)
    {
        if (property.IsModified)
        {
            auditBuilder.OldValue(propertyName, property.OriginalValue)
                .NewValue(propertyName, property.CurrentValue);
        }
    }
}

The code samples below come from my GitHub repository. You can find a reusable API .NET project template with a full set of functionalities there.

FinOps As a New Skill

FinOps As a New Skill

Considering cost efficiency is the major difference between the public cloud model and the on-premise set up. The knowledge about how not to waste money and get the highest return is becoming crucial to being a good business partner. As a team, you should observe key metrics related to budgeting and identify opportunities for cost reduction. Achieving this without monitoring is impossible (I wrote about why it is important here). Any useful information that can be collected about your users’ behaviour and how your system works is a must. Did you have similar situations in your experience? What were your most interesting production failures? Please discuss it in your comments.

Recommended Books:

Pick up your next leadership book on Amazon using my affiliate links above to help support my blog.