EN
Back

Taking a Performance Bottleneck and Crushing It – IMS Login Domain Server Rewrite

05 May 2022
By Janno Holm, IMS User Account Management Delivery Manager
Share this article:

What is It?

In lay terms, the Login Domain Server is a user-facing component that serves as a proxy between gaming clients and our online casino management platform, IMS. It handles a number of critical tasks, including site whitelists, launching games, and of course the most critical of these – player login / logout. Namely, it acts as a proxy for Login Service, which manages these player logins. Login Service is the most widely integrated and used component in IMS, serving over 39 million active players. It is used to login or open games in countless ways and environments, from a simple balance check at the player account web portal to logging into a sports betting app in the middle of the Super Bowl. 

As you can imagine, login functionality is one of the first things to get implemented in any online product. If it works without issues, you tend to leave it well alone in case your bold upgrades or tweaks break it and cause a major business outage. Therefore, it has seen no significant improvements for the past 10 years. This led to a situation that we now discovered ourselves in – a critical component that’s about 20 years old, using cutting edge web tech from early 2000s. Yes, you guessed it – PHP, the most ancient technology at Playtech. 


Meanwhile, time has not stood still. The number of player logins it is supposed to handle has grown tens of times in big online sites using our software. This led us to a situation where simple fine-tuning was no longer cutting it. Eking any performance improvements out of this old tech had become very difficult. The dev effort to result ratio was becoming punishingly bad. Additionally, the component was stressing the database to the point where the database even crashed a few times. 

For instance, we have a site configuration which allows 5500 worker threads and accordingly takes up 5500 database connections. That’s 50% of all database connections allowed for the whole site. Even though these connections stay mostly idle, a fast ramp-up in connection count during peak times can still put enough load on the database to make it crash. 

These issues were mostly caused by our PHP core’s old design, and were not a shortcoming of PHP in itself. Today, we are almost exclusively a Java house and our PHP expertise is thin as a result. Before giving up on PHP, we tried introducing connection pooling, but failed. As a result, we decided that the time was ripe for a full Java rewrite of this component. The main aim was to crush this performance bottleneck for good. Desirable side benefits included improved security, less maintenance effort, and support for dockerization

Setting the Goals for the Rewrite

Aim for the stars and you will at least reach the Moon, right? A good set of goals can be very motivating, so we created the following for ourselves. 

  • Have the same API and behavior (luckily it is mostly a proxy to begin with).
    • Integrations cannot be impacted. There are too many of them to fix if they are.
  • Component must be scalable and efficient.
    • Has proper non-blocking IO implementation.
    • Probably will not use less CPU but should behave in expected ways and have clear limits.
    • Load on database removed completely.
  • Support for security tools.
    • Hardening against login attacks by using load-limiting tools. The option to keep load behind a wall and not let it in.
    • Already existing security tools and processes must keep working seamlessly. 

A Novel Concept in Practice – Creating a Mission Team 

At our company, we’ve had a lot of experience with many internal improvement projects. Owing to the inherent complexities of our products, they tend to take a long time. It is hard to keep focus with all the business projects, unexpected tasks, ideas, and distractions. We wanted to focus and roll something to production fast; to get the proverbial monkey off our backs for the next world-famous, peak activity events like Super Bowl in USA, the Grand National horse races in UK, the football World Cup, and others. 

To make sure that everything flowed smoothly, we decided to create a dedicated mission team to focus exclusively on implementation and rollout. Therefore, for the duration of this project, we took two developers, a QA engineer and a devops engineer off their normal daily tasks and gathered them into the mission team. The mission team had only one aim, one focus, and one goal – the implementation of the core of the new Login Domain Server and most used scripts, with everything to be ready for rollout ASAP. This novel approach was received well by our people and turned out to be a success. The team was dedicated for 4 two-week sprints of delivery.  

Laying the Foundation – Technical Improvements 

Due to pressing business projects, we do not have the luxury of creating new components every year. Therefore this project was a welcome chance to review our tech stack, at least to some degree. With our timeline being very aggressive we knew that we wouldn’t be able to create everything from scratch. As a compromise, we built our new Login component by using Open Liberty framework on top of our existing core library. 

The existing core library was needed so that the component would be integrated with everything we have, out of the box. Open Liberty was chosen because it is lightweight and did not have a ton of dependencies. It allows us to take advantage of game-changing benefits we did not have access to before like asynchronous HTTP request handling (meaning the worker threads do not block each other if there is an IO bottleneck) and hot deployment support that allows us to change server logic in less than a second – while it is running. As you may imagine, this is extremely useful for fast development and testing, and it also met our side-goal of making our devs happier. 

Another technological advance we implemented during the project was a JSON converter for our in-house binary Galaxy protocol. Our internal components use Galaxy protocol for communication, so by being able to convert JSON to Galaxy messages and Galaxy to JSON messages automatically, it allows external parties to use all our existing APIs over JSON with no additional development effort from us. It is now the de facto standard protocol for integrating with any external parties. 

One key decision worth mentioning was having the new component work as a proxy for the old component at the early stages of the rollout. This gave us full control over what parts of the component were switched over to the new solution and what parts are still proxied to the old one. It gave us the necessary freedom to roll forward or backward per script or casino or even a single player, without any changes needed to be made to the F5 load balancer. This approach also allowed us to address the biggest elephant in the room – the lackluster performance of the old component. The incoming requests were now proxied to the old component via a smaller number of more effective Apache workers, cutting down the number of database connections needed.  

When a Plan Comes Together – Rollout

Since Login Domain Server is the most integrated component we have, of course the rollout would not go forward without hiccups – and we didn’t expect it to. We had several mitigation efforts in place. The initial rollout plan was to introduce the solution to all our staging sites, wait for a few releases to catch the major issues, and then roll to production. Other precautions included: 

·       making the F5 load balancer switch a separate step

·       adding an option to test the new solution in production with specific players

·       introducing the initial switch for 10% of players 

The final step was getting a clear confirmation from the licensee themselves. 

We found that the precautions turned out to be very helpful. The only exception was the 10% player base test – by the time we got to that, all the issues it was supposed to help us expose had already been solved. However, the full process of doing a test in production by a licensee with the aim of getting a confirmation was about as heavy and slow as can be expected. 

Our release rollout was also fast enough so that we could choose to postpone the official switchover in case any issues turned up. Here’s a short overview of what we encountered. In total, there were 2 very serious and 12 moderately serious incidents directly related to the new Login Domain Server. Below is a sample of the defects we encountered. 

Defect description

Duration

Impact

Defect - LoginAndGetTempToken did not work with email

1h 45m

whole brand

Proxy healthcheck – redirect initiated by java, that F5 could not handle

5m

whole brand

Defect – small java/php diff. missing secret issue

1h 4m

whole brand

Defect – messageid integer instead of string as before

19h 31m

Very small - Live client native iOS only

Defect – ipCountry code no in response any more

6h 18m

Small - Live client native Android only

Custom and wrong solution built by licensee

69h 45m

Very small - native iOS clients only

Insufficient analysis – HTML response expected

16m

whole brand

Insufficient analysis – HTML response expected

48m

Cashier not available

Insufficient analysis – HTML response expected

1h 9m

Native iOS and Android clients

Defect – logout without invalidate remember me failed

3h 56m

Games could not be launched after first launch

The rollout started in the beginning of September 2021, so that was good. However, when you have hundreds of businesses relying on your service for daily operations, getting everyone on board takes a lot of time and effort. Therefore, the last site was switched over on the 1st of March 2022. Some of our client’s sites are still in fallback mode until summer though, because we decided not to rewrite some very deprecated and complex functionality and we needed to give them time to adjust. 

What did We Achieve?

Early performance tests indicated a considerable improvement – CPU usage dropped by over 60%, average response time was shortened by over 40% and we saw a marked decrease in the Database and Gateway loads. That was all well and good, but the final test was always going to be actual real-world results. 

Super Bowl is one of the most demanding peak activity periods during the year for us. One of our big licensees who specializes in sports betting put up some solid numbers during this year’s Super Bowl. We saw 70 000 requests per minute, 5000 logins per minute and 80 000 players online simultaneously. This allowed us to verify performance under real-world conditions. The site that hosts this licensee was switched over to the new solution and handled the peak load with ease.  

While the new component behaved as well as expected, the old component still got a share of its own load due to not everything being moved over yet. Proxy mode helped here more than expected. The Apache worker count was kept at minimal levels. Not as much as a bump was observed during peak load. We reduced the load on database caused by the Login domain by a factor of about a 100x. With this we can consider our goals met and the performance bottleneck successfully crushed

Lessons Learned 

  • Focus team concept was proven as a huge success. Delivering the new component for rollout was efficient and the approach was also appreciated by people in the team. It reduced distractions drastically and helped to keep the focus on the delivery.
  • Analysis of production usage could have been done better at the start of the project. Especially regarding the usage of non-JSON responses used, a feature that we dropped in the new component.
  • The pilot should have been run in more sites before making the new component a part of the official release rollout. The component is too critical to take chances. Every small issue needs to be addressed before switchover.
  • We expected some incidents. The count may have been somewhat higher than we expected, but looking at the big picture, many incidents impacted only a small portion of activity. 

What’s Next? 

We still have some final polishing to do before calling it done. We plan to finalize rewriting all the scripts for the new component, so we can fully drop PHP by the end of 2022. This will help us remove any security issues associated with PHP and will also clear us from having to maintain it. Before we can achieve this, we have a few exceptional cases to take care of first. Target – end of April. 

Secondly, we will review the capacity of Login Domain Servers. We plan to consolidate nodes from 1vCPU to 2vCPU instances because this is more efficient with Java. Most likely this means we will be able to release some capacity, which would be excellent news! 

Read more about IMS business unit here.