System tracing with AWS X-Ray

Knowing your application is running fast or slow is one thing, understanding what makes up those characteristics is another. This is not a new problem, APM (Application Performance Monitoring) tools like New Relic and AppDynamics will happily take your money to make this problem easier and have done for a while. However, this problem gets even more complicated in a serverless environment where integration points are not always totally under your control. This is a frustration of mine as AWS Lambda can often be an opaque beast.

AWS X-Ray is a lightweight entrant into this field, with the extent of the configuration being a tick box in the ‘advanced settings’ section of the ‘config’ tab. The feature switch will even add the required permissions for you.

With this enabled, each request will now produce a trace.

Now for the interesting bits, we can see that this Lambda was cold and took a while to start executing. We can also see where Amazon have called out some of their ‘initialization’ time.

Slow Lambda execution

My Lambda includes a call to a RESTful API but the default X-Ray setup won’t show this. I want to call this out especially in my monitoring because this external dependency has the possibility of affecting my service. To do this I need to create a sub-segment, Amazon as ever have thought of everything and have created a decorated version of HttpClient with their own X-Ray integration. It’s literally a drop in replacement with no further configuration. With this addition, I now get a much clearer view of the profile of my service. The remote API call I’m calling makes up a large percentage of the total time taken.

Lambda execution with HttpClient augmented

Amazon does support creating your own custom sub-segments which is something I want to try out soon and maybe write about later.

The other interesting feature I’ve not yet tried but is also on the backlog for investigation is attaching metadata to the segments and sub-segments. I can imagine this being really useful by adding pertinent data to traces, this could then be used in investigating themes of slowness for example.

Finally, another freebie from Amazon: out of the box, a service map is created linking together any associated resources it can find from your traces. The green around each service is a ring chart of % success, these turn yellow if there are problems.

I could imagine a slightly more sophisticated version of this being a fantastic DevOps dashboard.

linked services

AWS CodeStar

I try to read about new AWS services as I hear about them from Amazon’s newsletters. The first two pages of the CodeStar tutorial this genuinely delighted me!

This is what I want as a developer of small projects. Very excited to try it out!

I say small projects because I would probably want more flexibility over where my source code is stored if I was thinking about enterprise size projects.

Reporting Squad Confidence

As a software team lead one of the common things I’m asked is “how is everything going?”, “will project x be ready by date y?” and it’s something I’ve always struggled with. Everyone who has worked in software delivery will understand that it is difficult to estimate this type of work. It is especially true though of teams working in a new area, where there is no prior experience of the task. This is often where my current team sits.

Over my career, this question has manifested in a number of different ways. In the time before Agile was widely adopted it was the project manager who thought that asking more often would increase the chance of a different answer. Currently, at Sky Betting & Gaming this question is turned around and instead of being pestered for an answer I report a status of Red, Amber or Green for the work the team is doing and include a short summary of completed work and next steps. This is absolutely an improvement but still felt very wrong to me. At the start of a piece of work, it’s easy to go along with the new energy and report Green, when it might be better to report Amber or Red since there are potentially a lot of unknown risks.

Recently with the help of Sean from Optimise Agility we’ve been working through some of the pain points at SB&G and I spoke to him about this interface between the leadership team and their need for assurance that the work was in hand and the delivery team and their need to do the work.

The traffic light strategy didn’t contain the information we wanted to give. I wanted to show that our iterative delivery approach didn’t match up with the Gantt chart style product roadmaps.

We decided to try an experiment with the delivery team, asking them a simple question of “Based on a release at the end of May, what confidence do you have that the following will be in production ready to be used.” Giving ten options ranging from “0 – No Chance”, to “10 – Dead Cert!”

 

Bar graph of confidences
Blurred because the specifics might not make much sense.

This gave us a quick, data-driven and cohesive expression of the remaining work. This felt right, but having this wasn’t enough; we needed to show other people that we were thinking about this and tell them. This became another quick experiment with this data being show at the edge of our team space on some spare wall.

So far the feedback has all been very positive, the experiment will continue and we’ll fine-tune this new approach over time.