Skip to main content

· 7 min read
Scott Dutton

At Sykes we use multiple servers to handle web traffic, as a result of that we need a way to ensure that browser session data can exist between requests. There are multiple ways to do this - we decided that redis was the most suitable for us.

Redis allows us to evenly distribute load between PHP servers, while keeping the data in the session highly available (failover redis cluster, provided by Elasticache). This means that if a node fails, we have a hot backup to switch to.

This blog explores an issue we had after an upgrade and how we identified, found and fixed the issue in the redis driver.

Session locks

Sessions allow a stateless transport method such as HTTP to maintain state between requests. This happens by giving the user a cookie with a unique string, allowing the server side to store data about a user. This usually includes preferences they have selected or things which are slow to calculate giving them a personalised and performant experience. This data is stored and read from redis.

To maintain a consistent set of data, PHP has session locking built in. This means that if you want to write to the session, you need to lock the session to prevent others writing potentially conflicting data.


|================| Page load 1
|==============| Page load 2

The above shows a simple example:

  • Page load 2 starts before page load 1 has finished
  • If page load 1 changes something in the session, page load 2 wouldn't know about it.
  • If page load 1 sets foo to bar and page load 2 sets foo to baz, which one should be used for the next request?

This gets even more complicated in the below example:


|=============================| Page load 1

|==============| Page load 2

Now page load 2 starts after page load 1, and also finishes before page load 1 finishes; this makes it really hard to know which version to use.

Session locking resolves this by allowing only one page-load to access the session at once, transforming the above to


|================| Page load 1
|******==============| Page load 2


|=============================| Page load 1
|*******************==============| Page load 2

The stars indicate the session waiting to be unlocked. This leaves the session updates unambiguous, but makes the client wait more for the loads to happen. This is especially true for sites with multiple AJAX requests or simply a user with two or more tabs loading multiple pages - for example, side by side browsing some of our properties.

What we noticed

After a release, we had a small number of alerts relating to not being able to get a valid session ID.

Redis issue

This caused a small number of errors, enough to trigger alerting, but was certainly not widespread. We have a team of developers who look at these issues as they arise; the first port of call for something like this is checking what is in the release. The errors were relating to getting an invalid session ID after session_start() was called.

The releases we do at Sykes are very small, so that in situations like this we can quickly see was it caused by the code in the release or something else? In this case, the changes were small, comprised of HTML and JavaScript changes - so we can easily rule them out as they have no impact on the server side.

This likely means that it was something else as part of the release which caused this issue; as we deploy with ECS Fargate, all of the images are available in ECR, so we downloaded the most recent release before the issue and the one which caused the issue and used Google's container-diff which showed there was only one difference, a PHP redis upgrade.

The issue only happens on some requests, which makes understanding the problem much harder; we need a way to replicate this request.

Replicating the problem

When resolving issues, it's always useful to get the smallest reproducible case with as few dependencies as possible; this helps everyone involved be clear about what the problem is.

The first step was getting a super simple page set up - I did this through docker-compose to bring online a Redis server and a really simple PHP page (just calling session_start()) and see how I could make this happen on every single request. I started looking at the diff of phpredis and saw there was some support for backoff strategies for the locking, so I started to have a look there. The issue had to be elsewhere after seeing that the driver still worked.

As the issue was around sessions and locking, the next thing was to add a large sleep to the request and force a record lock to happen. This caused an issue, so I downgraded Redis in the repeatable case, and the problem disappeared. So, we now have a repeatable case in a few lines of code.

See below for the examples;

Request 1

<?php

session_start();

sleep(10);

echo session_id(); // Returns a valid id

Request 2

<?php

session_start(); // this should block until request 1 has completed

sleep(10);

echo session_id(); // No valid session returned

Fixing the issue

Now that we have a repeatable case, we can dig into the difference in the cases. Redis has an excellent Monitor mode so it's a great place to start looking for the differences.

Dumping out the commands sent for 5.3.7 and 6.0.0 shows that there is a missing GET request after the SET's on the locks. That is very consistent with what we are seeing in the front end as no session data is obtained. Now, to look into the code to see what's going on.

The session code seems quite simple and looking at the diff just for this file, it looks like debugging was added, but also, a return was added in cases where a session lock couldn't be obtained. This return changes the behaviour and returns a failure; this stops the rest of the code from getting the session data. Previously, when a lock couldn't be achieved, the read-only session was returned. This is a better default position to be in than no data.

Checking out the code and making the change locally confirms that this resolves the issue. This issue did highlight that this small number of session have always been problematic; if something changes on those page views in the session, it won't be saved due to the session being read only; in our use case, this is much less problematic than the session appearing to not exist.

Conclusion

The update made upstream changes a soft failure to a hard failure - that is, when the session lock can not be obtained, changing from failing with session data to failing without. The commit which added this issue was about trying to make the errors more verbose so the end user has a better idea of what is happening internally.

The 5.3.7 version shows this when trying to write.

<b>Warning</b>:  Unknown: Failed to write session data (redis). Please verify that the current setting of session.save_path is correct (redis:6379) in <b>Unknown</b> on line <b>0</b><br />

The Unknown on line 0 is not too helpful to debug, but the rest of the error at least points it to the correct place. In my opinion its more correct to do this, as the issue is not reading the session but writing. In general reading takes place more often than writing, so having the error only occur on write makes more sense. If you fail to read you don't know who the user is at that point so can not handle it gracefully - having the users information gives you more choice (write to a database for example)

This was an interesting problem to solve as it appeared to be intermittent, and after replicating the issue the fix was quite simple - the process of reducing the problem to the smallest reproducible case helped a huge amount in being able to find and fix the problem.

About Sykes

The techniques shared in this article were produced by the development team at Sykes. If you are a talented Data Scientist, Analyst or Developer please check out our current vacancies.

· 5 min read
Scott Dutton

At Sykes we store data in S3 we currently have a large amount of data which needs to be managed according to various lifecycles.

I was recently looking at some of the reasons we store data in S3 and investigating how we could store the data in a more cost-effective way. For example, some data needs to be accessed frequently and other data could be stored in deep glacier as its important we keep the data, but it does not need to be online.

We have object's in the same bucket which would require different storage methods, we can lifecycle them based on their tag, which is fine for new objects which we can write with the correct tag, but we have lots of old objects which would need the correct tag.

The tagging

Tagging objects in bulk is a solved problem, questions such as this on stackoverflow show you how tagging in bulk would work.

aws s3api list-objects --bucket your-bucket-name --query 'Contents[].{Key:Key}' --output text | xargs -n 1 aws s3api put-object-tagging  --bucket your-bucket-name --tagging 'TagSet=[{Key=colour,Value=blue}]' --key

Breaking this down a little, we are getting all of the objects and then passing over to another s3 command to do the tag. We want to filter the objects - so we add a grep in between.

aws s3api list-objects --bucket your-bucket-name --query 'Contents[].{Key:Key}' --output text |\
grep "some-term" |\
xargs -n 1 aws s3api put-object-tagging --bucket your-bucket-name --tagging 'TagSet=[{Key=colour,Value=blue}]' --key

This works well on a small scale. I could see from the output that it was working, but would have taken days to tag all of the objects.

Scaling this method

On a small scale (1,000 objects) a tagging takes 12.5 minutes. That comes out at 1.3 objects per second, which is actually quite slow.

The script works in effect the same as below

aws s3api put-object-tagging object1
aws a3api put-object-tagging object2

The AWS CLI is a wrapper around a HTTPS API, so seeing the commands like that makes it clear that you can't reuse a HTTPS connection. This means each request is a DNS query, TCP negotiation, TLS negotiation and then finally the HTTP request and HTTP response.

Previous work on reducing the HTTPS overhead has proved that using Nginx as a reuse proxy with a keepalive connection to the backend HTTPS server has shown real performance increases.

A sample nginx config is below which would allow this

upstream s3 {
keepalive 100;

server your-bucket-name.s3-eu-west-1.amazonaws.com:443;
}

server {
listen 80;
server_name s3.localhost;
location / {
proxy_set_header Host your-bucket-name.s3-eu-west-1.amazonaws.com;
proxy_pass https://s3;
}
}


and a docker command to use this (assuming it's saved to a file called nginx.conf in your current working directory)

docker run -v `pwd`/nginx.conf:/etc/nginx/conf.d/default.conf:ro -p 12345:80 nginx

This brings up a docker container on the host port 12345, which will keep a connection alive, removing the DNS lookup, TCP connection and TLS connection per request.

Anything .localhost resolves to 127.0.0.1 so it's a great no config way of getting this to work. We can then override the endpoint which the AWS CLI uses to use this localhost connection instead by passing --endpoint http://s3.localhost:12345 to the AWS CLI.

Running the same 1,000 objects though now returns in 7 minutes 1.6 seconds a 45% reduction in time taken! This increases the objects per second to 2.37.

Going even faster

xargs by default runs one process at a time, so by default running 1 simultaneous request. nginx is also designed to handle multiple requests, so we could start to thread the requests.

xargs has a very nice flag which allows this, -P {number} will allow you to run that number of processes in parallel, taking advantage of almost every machine having more than 1 thread available.

You can choose any number for this, I have 16 CPUs available on my machine (on linux running nproc will confirm) - I chose 10 as a test number, this leaves the final command as

aws s3api list-objects --bucket your-bucket-name --query 'Contents[].{Key:Key}' --output text |\
grep "some-term" |\
xargs -n 1 -P 10 aws --endpoint http://s3.localhost:12345 s3api put-object-tagging --bucket your-bucket-name --tagging 'TagSet=[{Key=colour,Value=blue}]' --key

and that now runs in 45 seconds for the same test 1,000 objects, bringing the total to 22.2 requests per second!

Conclusion

Connections, especially TLS connections take time to make; and for any large amount of requests (either via automated means or natural if a HTTPS connection is made as part of an API call) can be reused and improved in a similar way. It's transparent to the end service (in this case s3), but it does require an extra service in place which could go wrong.

We went from 1.3 objects per second to 22.2 objects per second (a 1707% increase) and wall time from 12 minutes and 30 seconds to 45 seconds which is a reduction of 94% from a minor tweak to the connections.

About Sykes

The techniques shared in this article were produced by the development team at Sykes. If you are a talented Data Scientist, Analyst or Developer please check out our current vacancies.

· 7 min read
Scott Dutton

This article explains how we prepared the website for an unknown amount of traffic.

About Sykes

The techniques shared in this article were produced by the development team at Sykes. If you are a talented Data Scientist, Analyst or Developer please check out our current vacancies.

· One min read

Sykes is a platform that delivers for our customers and owners alike, at scale. So far this year we have delivered over half a billion impressions (individual search results) this is 27% up on 2020 (still pre-pandemic at this point… 

DataFriday

Digging into this data a little deeper we can see that we are serving more than just cottages, impressions here are divided over number of properties, on average as a proportion Caravans have received the highest proportion of impressions per unit.

DataFriday

However, breaking down our properties by type reminds us that cottages remains our bread and butter, but we have a healthy selection available for whatever our customers fancy.

DataFriday

If you would like to join our fantastic team of colleagues, please get in touch!.

· 2 min read

Today I'm looking at how the powerhouse of our business - our people - has changed since I joined Sykes in 2016. We've gone from two hundred people in two offices to nearly one thousand people based across over twenty offices spanning two continents in just five years. That's phenomenal growth.

Every year we've increased headcount, even in 2020. In the first two months of 2022, we've already achieved the headcount increase we made in 2021. Assuming we continue on this trajectory, 2022 looks set to be a record-breaking year for new colleagues joining the Sykes family.   

DataFriday

It's interesting is to see which teams are driving this growth. Our Technology and Product teams have expanded by 200% each since 2016. This capacity ensures we can optimise our internal systems whilst continuing to push the envelope. We've created an effective UX team to optimise our online experience and launched ambitious infrastructure projects, such as setting the whole business up for hybrid working. 

We've developed whole new functions, such as our Digital Media and Integration teams. These teams reflect our ambition - we're dedicating people to ensuring we do a great job of bringing new brands into the Sykes family - and our commitment to quality. Building a dedicated Digital Media team means we've been able to showcase our fantastic cottages with high-quality images. We're now strengthening that team with more people to develop exciting new channels to communicate with our customers, owners and people. 

But despite our push in the technology and digital space, the personal touch matters more than ever. I was interested to see that - despite over 80% of our business taking place online - we've still grown our Reservations teams by over 200% in five years. Even in this increasingly online world, we still need more great people than ever providing expert advice that helps customers plan their perfect holiday. 

If you would like to join our fantastic team of colleagues, please get in touch!.

· 2 min read

Since Sykes have migrated to using Google Analytics 4 (GA4) we have been more easily able to compare website and app events. This data from Dec 25th 2021 to January 23rd 2022 highlights the importance and power of mobile apps as a channel. During this period, the website had over 16 million views, while the apps had just 642,000 (around 4%), although the booking share for app was approximately 7.5%.

While mobile app booking share may only be around 7.5% (and growing), we see some significant differences in the number of key events per user:

App Events per UserWeb Events per User
Sessions per month102
Searches per month176
Property views per month336
Properties added to favourites per month79
Average Engagement Time16 Minutes6 minutes

  Engagement Time

Aligning our analytics across platforms has enabled us to get a broader picture of our customers’ journeys.   We are continuing to improve and standardise our analytics to gain more insight into multi-channel behaviour to help us understand and support our customers. User research and behavioural data gathered from Google Analytics and other sources drives our decision making and are integral to our success at Sykes.

About the Sykes App Team

Sykes has an in-house app development team of 4 developers and 1 tester, working on native applications for iOS and Android implemented in Swift and Kotlin.  We maintain a crash-free rate of over 99.9% and our apps have a store rating of 4.7 stars. We use Google’s Firebase SDK for tracking which integrates with Sykes Google Analytics dashboards.  The apps support deep-linking and push notifications, as well as home-screen widgets on iOS devices.  For iOS we use Clean Swift and the VIP Pattern and for Android we use the MVVM pattern.

We strive to give the best possible experience to customers when they plan, book and enjoy their perfect holiday.

You can find our apps here:

Android: https://play.google.com/store/apps/details?id=uk.co.sykes

iOS: https://apps.apple.com/gb/app/sykes-holiday-cottages/id1263445398

About Sykes

The techniques shared in this article were implemented by the development team at Sykes. If you are a talented Data Scientist, Analyst or Developer please check out our current vacancies.

· 2 min read

We collect event data in order to optimise our website and ensure a world class booking experience for our customers. To take us to the next level, last year we invested in a new platform using Kafka and Snowflake. During our busiest period in our history we have streamed nearly 5 million events in a single day.

DataFriday

This is key new capability for Sykes, and we are using it to turbo charge A/B testing capability, ensuring we continue to deliver a seamless booking experience for our customers.

During the same window our Product teams have been incredibly busy, experimenting and testing new features on the website, as you can see there are well over 100 experiments running on the website, peaking at 134 earlier in the month.

DataFriday

With this capability in place, we can gain a much greater understanding of our customers’ needs and their behaviour through the booking journey, in real time. The Product team are fully self-sufficient with a scalable platform in place.

About Sykes

The data and techniques shared in this article were produced by the development team at Sykes. If you are a talented Data Scientist, Analyst or Developer please check out our current vacancies.

· 2 min read

Customer satisfaction is passion for us at Sykes and NPS (net promotor score) is something we obsess over in fact it is in everyone’s objectives. This year we have managed to maintain an NPS of 74 since the start of January and the satisfaction levels are climbing.

DataFriday

Finding a true industry benchmark is challenging but we are incredibly proud of our results. According to an Survey Monkey article 74 puts us well inside the top 25% of companies across all industries. Our volume is high with over 40% of customers leaving feedback.

DataFriday

A key driver of NPS for us is the experience our customers have with us when calling in, as you can see on average, we managed to answer over 95% of all calls across all lines, with not a single line dropping below 91%, again measured during our busiest trading period. This goes to show the hard work of our operations team and effectiveness of our resource planning. We have answered over 40,000 customer calls since 1st Jan.

About Sykes

The data and techniques shared in this article were produced by the development team at Sykes. If you are a talented Data Scientist, Analyst or Developer please check out our current vacancies.

· One min read

A key feature of our technology platform here at Sykes is how our systems can quickly respond to changes in customer demand. Last week we showed how we’re seeing a surge in demand for short lead breaks. This time we’ve pulled together a quick viz to show the impact that surge in demand has had on average income for our owners

DataFriday

This simple viz is showing the year on year % difference in average owner income per booking for holidays in January. As you can see there was healthy growth in 2020 versus 2019 but this has been dwarfed by the increases our owners have enjoyed in 2022. Further evidence of the advantages of being on board with Sykes and our award-winning technology platform. Our tools and team work together ensuring we balance booking volume and revenue growth. We have come a very long way from 52-week brochure pricing.

DataFriday

About Sykes

The data and techniques shared in this article were produced by the development team at Sykes. If you are a talented Data Scientist, Analyst or Developer please check out our current vacancies.

· One min read

Following on from last week, we have put together a quick viz to look at demand (searches) since last week and how they compared to the same period from 2020 (pre covid, so demand patterns were still normal then.)

DataFriday

With the exception of the lake district, across virtually all regions we are still seeing much more demand for the early months of the year this week.

About Sykes

The data and techniques shared in this article were produced by the development team at Sykes. If you are a talented Data Scientist, Analyst or Developer please check out our current vacancies.