Web RUM: See 100% of What's Happening in the Browser

0:00

Cliff Crocker

Before we get started, I want to go ahead and introduce my friend and colleague, Andy Davies. Andy Davies is someone that I've been working with most of my career in web performance. I've probably learned more about the web from Andy than anyone else that I can really attribute it to. He's been around, for those veterans doing talks at Velocity Conference. He's been recently speaking at Perf Now in multiple events, in the EU. He's authored some books as well, using web page tests. Literally, love to tell people that Andrew wrote the book on web page tests, along with, with Rick and Marcel, I think it was, back in the day, but still a heavily referenced book.

0:43

Cliff Crocker

And the Pocket Guide to Web Performance, I think, was another one that he put out, and I'm probably missing others, but… I've had a great time working with Andy. I've been chasing him to work with me for quite a while, actually. Recently, I guess we've been working together for the last 5 years, but prior to that, I tried to regroup him many times. It took me going to Speedcurve to actually bring him on.

1:06

Cliff Crocker

And now we're together at Embrace, where Andy's gone from being the web performance consultant into the product role, where he's leading product for Embrace, when it comes to all things web. So, really excited to be chatting with Andy today, and I call him Professor Andy, because I always learn something from him.

1:25

Andy Davies

It's Cliff. I don't know quite how to follow that. So I've been working… I've known Cliff for 14 years, which is, quite a long time in this industry. I remember the first top web performance conference I went to in the US, in Santa Clara, at Velocity in 2012, and…

1:46

Andy Davies

The person who was running the workshop on the first day was Cliff. And… Yeah. as Cliff says, he's tried to hide me a few times, they all fell apart, and none of them were for Cliff's reasons. None of them were Cliff's fault. But, you know, one of the… One of the things I think, you know, that I find… you know, quite inspiring about Cliff and our… our various journeys together is, before the call, we were counting up how many RUM products We've been involved with building between us. and… I'm on 4, and Cliff's on 5, so he's ahead of me on that front, and I've got a bit more experience. Oh, on that, but, you know, it's a pleasure to work with him, and, you know, it's great to be able to work on the product and embrace.

2:38

Cliff Crocker

Thanks, Andy. And we really wanted to kind of call this out. A lot of people refer to this as the Bragg slide. I actually look at it quite differently, and this is the thank you slide, because if it weren't for all these wonderful partners that we had, great brands and businesses that we work with.

2:54

Cliff Crocker

None of this would be possible. The work that Andy and I would, you know, have been doing for all these years, really would be to no end. So, I really thank all these customers that we get the opportunity to work with, to help them as they strive to make all of their user experiences better.

3:11

Cliff Crocker

And create not only a better web, but a better mobile experience as well when it comes to the native applications. And this is us. We're embrace. We're embrace now. So this is, this is pretty awesome. when Andy and I both started here, a lot of the focus initially with Embrace was mobile, and really was mobile around the, the open telemetry observability space, focusing on, native RUM. Now our purview has really expanded to think about all things user-focused, especially when we think about people on screens. So whether that's mobile device, or that's a web property, or you're looking at synthetic monitoring, it's all under the brace brella, one big happy family here, where we're really focused on Changing the game in terms of the focus of observability, which traditionally has been a little bit more of a back-end focus, leading with user observability, user-focused observability, we believe is the right path forward, and goes along with what, you know, Andy and I have always felt when it comes to trying to understand and improve the user experience.

4:28

Cliff Crocker

We've got some stuff to talk about today. We're probably going to try to get to the demo fairly quickly, because Professor Andy's got a lot of stuff to show us in terms of what the… I'll stop saying that. In terms of what the product capabilities are within Embrace, and what we've been up to, you know, over the last 3 to 6 months here.

4:47

Cliff Crocker

So today we're going to be talking about WebRum, and we'll talk about why we're specifying it as WebRum when Andy hits that. We're gonna touch a little bit on OpenTelemetry in terms of how it relates to our focus, within web. Then Andy's gonna get into different use cases and actually just drive the product for the discussion to talk about how WebRum is used. And then hit a little bit about what's coming next, what's on the immediate backlog as we start to think about the things that we continue to build out around the web experience and embrace. And then, finally, we'll have some time for Q&A. So, it was really active discussion yesterday. Love seeing the activity in the chat for the synthetic discussion. Would encourage everybody to continue with that as well. We love questions, and we'll also follow up if we run out of time and aren't able to address all of those.

5:36

Cliff Crocker

So let's get started!

5:39

Andy Davies

So, to begin with, we're going to talk about WebRum, and at Embrace, we refer to it as WebRum to differentiate it from run for native mobile apps. We have this… Slight challenge is… As a… as a company, as a team, we've been used to talking about mobile, and when you talk to web people about mobile. They mean, yes, we have a mobile website, we do mobile. When you talk to native app people about mobile, yes, they do mobile because they have a native app. So what we've done is we're differentiating between, essentially, RUM for apps and RUM for the web, just to make that clear internally within ourselves and internally to our customers. And to begin with, I'd like to go back to, you know, what RUM gives us. And… I've… Always loved rum. And I've always been a great believer in its importance, and I've… validated its importance. Because what RUM does is it gives us the ability to understand The experience we're delivering to our users. To our visitors, to our shoppers, whoever is using our site. We get detail on that experience. Across all the different types of devices they use. across the different network connection types they've got. And so, we build up this… Rich picture of How our users experience our site and our web apps, but also how that experience drives the way they behave. And just to recap, we're going to compare it back to a bit about synthetic that Cliff talked about yesterday.

7:31

Cliff Crocker

We kind of set the stage with this a little bit yesterday, if you attended that or are listening to it pre-recorded, where the term of synthetic and rum oftentimes are interchanged with lab data and field data. So, just to refresh, the synthetic data and the use case there is this clean room environment where we're controlling all the variants and able to test something under different conditions, but that's why we're fortunate more as a lab.

7:58

Cliff Crocker

And with WebRum, we refer to that a little bit more as field data.

8:04

Andy Davies

And we kind of refer to it as field data, because… It's not from the lab, it's from the real world, but… I kind of have a little bit of a challenge of referring it to as field data, because when I think about fields, I think of green pasture, or wheat, or corn, and it's… it kind of paints this picture of simplicity, a field of all the same crop. Whereas I kind of think about rum as being more of a forest. we can look at a forest from above and, you know, see its size and see a bit a bit of the variety in it. We can begin to get an overall picture. But it's not until we start to drill down that we begin to discover its variety, you know, whether that's a population of people, who are using an old browser, or a population of people who are coming from a certain part of the world. So, we can drill in the same… into RUM data, in the same way as we can begin to explore a forest, and begin to find beautiful things, and rich variety.

9:16

Andy Davies

And then… We can actually go down to individual plants, in this case, or individual user sessions, to begin to understand what made that session different, what makes that species of plant different.

9:35

Andy Davies

to everything else. So we can start at this high level and drill right down to this low level, but what it really does, you know, is It's great having all this, data that's contextual, but, you know, what we really need to focus on is what matters. Where are our visitors experiencing pain? You know, where are they experiencing slowness? Where are they having… Crashes or exceptions? Which network requests are failing? So, the aim of RUM, in our eyes, is to help people solve problems.

10:13

Cliff Crocker

And this is done in a lot of ways, and historically, a lot of us that have been in this space, whether it's our first, second, third, fourth, or fifth Rum product we've been building, have really been dependent on the browser APIs, the great work that the W3C Performance Working Group has done to advance the APIs we have in the browser to get those user-focused metrics and that data that we care about. One thing that's kind of different, and, you know, in the spirit of always learning, in Embrace, a lot of that, obviously, we've still got everything that we're building on in terms of the browser, but we've also opened up a world where OpenTelemetry on the client side is now possible, is now a thing.

10:56

Cliff Crocker

So, when we think about it, this does a lot of things for us. So, OpenTelemetry, if you're not familiar, is essentially an open source standard, a way of collecting telemetry data across distributed systems that allows you to tie all these traces and end-to-end things together. So, really connecting from the front end, the user experience, directly to the back end, root cause, through a distributed trace. By design, OpenTelemetry is also vendor neutral, so another thing that we like to focus on around, you know, open standards and open source is, you know, the ability to instrument once and continue to use that, not having vendor lock-in or or such things which could kind of hinder a group or, you know, cause a little bit of an interruption to your metrics or the data that you're capturing. So, the idea of instrumenting once. That then allows us to go ahead and trace that through to the back end, whether we're working with, you know, our own, you know, observability stack that we've built out ourselves, or we're working with multiple partners, like Elastic, or Honeycomb, or Chronosphere, or other things on the back end.

12:06

Cliff Crocker

In the case of Embrace, what that allows us to do is capture all that telemetry data in the forms of all the browser APIs and monitoring that we're grabbing, as well as spans and traces and logs, and send those across, to Embrace that can then be stitched back and forth between a back-end open telemetry service and Embrace. So, giving you real true end-to-end view of your data. But let's go ahead and jump into some of the practical use cases. So, everybody came here to witness and see the demo, want to show a lot of what we've been up to as we've been building out the web product, and Andy's going to walk us through a few of these use cases and scenarios.

12:47

Andy Davies

Yeah, the use cases I'm going to walk through today are, you know, how do we connect performance to business outcomes? How do we enable engineering teams to ship faster and with confidence? How do we find and fix issues fast? And the one I'm going to start with is, how do we go about shipping the best

13:06

Andy Davies

Experience on the web. And… Embrace as a product has lots of features, and what I'm going to do today is just take you through the highlights of the web-related features and how they help solve these use cases. And… Where we really start is… with Web Vitals. We… you know, a lot of the industry focuses just on core web vitals, so on largest contentful paint, when the largest piece of content is painted to the screen. you know, interaction to NextPaint as a measure of responsiveness and cumulative layer shift as a measure of how much does content move around as the page loads, or as somebody As somebody scrolls. But… To us, there's more than that. There are moments that matter before largest contemporal Paint. Because until the browser has some HTML content, it has no work to do. So, if the server is slow. delivering that HTML content, then there will be a delay for the others. Then, you know, when does content first get shown to the visitor? It's no good delivering Web pages really quickly. if it then takes longer to show it, and we'll explore some of these metrics. The other metric that we came up with is synthetic has something called total blocking time, which is, kind of an estimate of when the main thread is free enough.

14:36

Andy Davies

To handle user interaction. As part of the long animation frames API, there's a… a blocking duration, which measures for each long frame. how much time was it unable to respond to the student interaction? So, what we did there is we total those up throughout the page, lifecycle, so that we can provide a measure from Chrome that tells you how long your scripts are preventing people from interacting for in total. So, for those… Six vitals, you know, what we've done is introduced a tab arrangement, which has got a red, amber, green status to show Where you are, or where the site is, Against the ratings. And then we… in the overview, we show different ways of looking at that data to give engineering teams an overview and help guide them to where

15:39

Andy Davies

They may want to look next. By default, we group, the data we have together by page groups, so… In retail terms, a retailer might have a page group for product pages, another one for their category pages, one for the home, for listing, etc. A news publisher might have one for their homepage, the sections of the…

16:05

Andy Davies

news sites, and individual articles. Indeed, they may have many different types for articles, because publishers often have just a main article with a headline and a hero. Sometimes they have live articles that are related to an event that's going on, and it might be sport, or it might be the politics of the day. They have video articles, so it just gives people a way of grouping them together so that Pages that are generally built in the same… the data for pages that are built in the same way is grouped together, so we can begin to identify which are the… Ones that have, we want to focus on. You'll see all of these are color-coded, And it's the Core Web Vital… it's the Web Vitals rating at the 75th percentile. Green is obviously good, yellow is… needs improvement, and red is poor. You'll notice some of them are black. So this one, for example, and…

17:03

Andy Davies

What we do is, where there's not enough data to produce a statistically relevant, measurement. We highlight them by colouring them black so that people don't go on wild goose chases, you know, chasing a really high For example, 3.9 second, first contemporal paint, In this case, That comes from 93 samples, so I'm not gonna spend too much time, chasing that. As part of the overview, we begin to just show people the breakdown between desktop and mobile, and the app… the data for the app we're looking at here is It's primarily a desktop app, For more consumer-focused sites, I'd expect to see more mobile data. Well, you can show it here by browser, and we show you where people are coming from in the world there. to understand that, you know, in our example, you know, we've got North America, and we've got El Salvador, and we've got quite a large population from El Salvador, and perhaps, you know, we want to focus on how do we improve The experience for those people. In El Salvador. But, Go back up, what we can do from… We can obviously go across the different tabs and look at the data, but what we can also do is, you know, say, for example. this LCP value, I go up here, that's 2.8 seconds. Perhaps I want to look at that in a bit more detail. And… By clicking on the show details, we can switch tabs, and it's already filtered down to that page group, so we're just Looking at that page group here, and we can see… The… the… LCP is 3 seconds, we can see how it varies over time. And on the background of the chart, you'll begin to see, you know, columns that show us the sample size. For that point in time, because what we… often CM run data is the seasonality of how usage changes throughout the day, and it seems pretty common all over the world, regardless of the site, that it gets quieter, between the hours of about 2am and, 6, or 7am? But we'll also see seasonality throughout the week, where some sites' performance characteristics change at the weekend. A few years ago, I did some work for a Dutch supermarket. And what we noticed in their data is at the weekend. The experiences went from people using mobile to people using desktops. and… The experience was actually slower, because… People keep desktops and laptops for a long time, so a lot of these people were using older devices. At home. So we can… we can look at… Yeah, how LCP is varying of time. We get some other contextual data about, you know, what's the split between good. needs improvement in poor. How the distribution varies, you know? This site here doesn't actually have to do that much work to probably, improve its LCP value? And then we can begin to break it down to help guide people into, okay, so… My LCP is 3 seconds. Where? is that time being spent? We can see, you know, in this case. we get the HTML reasonably quickly. Then there's the delay. Well, after the HTML's been delivered, before the browser discovers it needs the, needs to download an image. The time to download the image, which really isn't that big here, and then the time, once he's got the image, before it starts displaying it on the screen, so… Yeah, it begins to guide people as to where they might want to look. But we can also look at this data in different ways. We know that the page group is a group of pages, so we can… split down and start to look by path, to actually see, okay, so amongst all these different paths that are grouped together in this group, you know, which is the one I really need to go look at? Which one… is causing the problem, and in this case, there is a clear culprit. That's not always the case. Sometimes we see a variety of components, where they're… They can all be bad, so… You know, we get this ability to go from that high-level. Drill down, find a group of pages that need, some work on them, and then, Drill into individual, pages themselves. The other thing we do is we expose, some examples of it. So, yeah, this top one here, somebody, Chrome desktop, LCP is 4.1 seconds, and we can actually drill in to understand more about this visitor session. And this begins to show the richness of the data we capture. Some of it is just pure data. web API-driven data, some of it comes from open telemetry, like logs and spans, and some is data we stitch together, such as User flows. And… Yeah, once… once we've got this, we can… drill down and begin to look at this person's session. And the requests that were being made. before, LCP loaded, and, you know, whether the request was cached, in some cases, whether it Failed, we can, you know, see other things about this session, so we can begin to see individual… Are there individual network requests that were made? Unfortunately, I picked the wrong session, and I don't have one here, but, we can also… Drill down into individual spans to start to show, the span experience, and how the components, fit together, and we'll… we'll come back to that. And often, I use the timeline to help people diagnose Their issues, and then… sometimes also use what's known as the Tracy Span Experience, which I'll try and get to via another page in a minute, to diagnose the issue. The other place that's often a good jumping-off point from here is to go into synthetic, and You know, either run a one-off test, or use the tests they're already running. To drill down, you know, and begin to find that page, run a test on the page that has the, long. The high LCP Go find a test. Drill down on… into it, and start to use synthetic and a synthetic waterfall, just zoomed in to LCP to begin to diagnose the issue. And because, Embrace's product, and we're using our own data here, is a single-page app. that… Time taken to bootstrap the app. It's… it's significant, and that's what causes our… Slow LCP. But coming back to here, so… We've gone in, we've found a place where LCP, It's higher than we'd like it to be. We've used, Embrace Rum to drill down, find the page, find some examples, drill into the timeline to see what's going on, and then we've been able to also run a synthetic test. too. Understand what might be contributing to that slowness. There's another side. That's… that's kind of the optimization case. There's a… there's a flip side, to this, in that, yeah, how do I find issues when… my site slows down. And the way we enable that is we start by creating budgets. Budgets is a term Tammy has used, for many years. To describe thresholds. that we want to try and be within. And what I did in this case is create a budget for LCP, and I set it to… Two and a half seconds. Looking over the last, 5 minutes. And… every time this fires, and… Can you see it? It's fired a few times? I get an alert in my inbox, and I can click on that link and come back to the alert, or I can have the alert in Slack, for example. And… What we can then do is, we've got some quality live features we've still got to plug together to get this quickly back to the right place, but we can… use the same approach. We can go back to Web Vitals, we can go to LCP, we can… pick a custom time, and I happen to know there was one between… 1 and 1.30 this afternoon. And… You know, we can go and find… some of these exemplars that exceed those thresholds. We see, even in that small time window, we don't really have any statistical, significant Data, but we can go and find examples of slow LCPs, and we can drill back Into that timeline view. and begin. to… Understand why, this person Might have had a slow experience. We've got other ways of looking at, this data, too. We can… Because sometimes it's… it's not just… A high… We're vital or a slow experience that impact our others. Sometimes it's the fact that they have an exception. And one of the… Challenges with the web is… understanding… You know, whether an exception was significant or not. If you've ever… open DevTools in a browser. and, looked at the console while you load a web page. It's amazing how many exceptions some websites throw. And… Sometimes they matter, and sometimes they don't. And what we've tried to do is… is use… Create a severity score to hint to people which… hint to engineering teams Which, exceptions they should really care about, and which ones they should, drill into. We've got a weird one here where, somebody's trying to connect to a Chrome extension, which is, quite interesting, but… Yeah, we can look at… and bearing in mind I've drilled down to a Hours window here. But we can… Drill down, and once we've identified those, exceptions are important. We can drill in and take a greater look at them. And, you know, we've got a whole series of examples of Of this exception happening, we can open it up, and… For customers who upload, their source maps, we're able… Using the, stack trace to identify where in the code the exception occurred. So then… People can get a good idea of where the exception is. We also expose exceptions and network errors, which I'll come onto shortly, via our MCP server, so… you can use your, AI tool of choice. To query our data and begin to… Understand what are the common exceptions you're seeing. Where are they occurring? Who are they occurring for? So it helps take some of the toil away, and Ex… Help you explore that data further. The other… Area to look at is… We also… track network errors. And the network errors created, You know, as people… use this application. And, again, we can drill down and… We get to a session, so we can see which sessions were affected Bye. This error? And… Sometimes it's… several, sometimes it's many, depending on the severity of the error. And again, we can drill down the timeline to give people The view of the whole, context, and… What? Show me? Nope, that's not what I wanted. So we can drill down to begin to look at, this whole visitor's… session on a page-by-page basis. And… Ow… network requests, And… Exceptions, and other factors are affecting their session. One thing…

31:35

Cliff Crocker

One thing I wanted to point out that we were chatting about earlier was, from that exceptions view, if you wouldn't mind going back to that for a second, we sometimes take this for granted a little bit, but like you said. so many exceptions out there. It creates a lot of toil. A lot of people are sort of like, okay, where do I start? Are these exceptions I should care about? You know, the first thing I think that's important is that we're focusing on the user impact here, as we've seen throughout the product, right? Like, how many users, what percentage of users we're impacting. But the severity score is something worth, like, another sort of double-click on, because this, to me, was really differentiated when I first started learning about, Embrace Web.

32:17

Cliff Crocker

And that's because not all exceptions are created equal. A lot of them, they're silent, people don't really deal with them, they're handled. What we've done is we've actually come up with an algorithm that combines several different scoring factors to allow us to understand which exceptions we should focus on, meaning which ones are actually probably going to be the most problematic, maybe even resulting in a user abandoning the session. So we're looking at things like, okay, unhandled exceptions. Is it a third-party or first-party exception? How close to the end of a session was it when this session occurred, or exception occurred, indicating, maybe, was there, you know, a dropout in the flow? So, it's an important thing to look at, because I think that sometimes, in our web performance background and world.

33:01

Cliff Crocker

People tend to kind of think about exceptions as an afterthought and as toil. But certainly, we know that some exceptions can have a huge impact on the user experience. So, that's what we're trying to accomplish here, and I think maybe just worth, you know, overstating again.

33:18

Andy Davies

Yeah, and I think… I think the other thing to keep in mind when we're talking about that, the impact of JavaScript is, We look at exceptions here, but also, you know, when we're in the Vitals dashboard, we can begin… Using long animation frames, to help people understand you know, how much script is actually executing on each of our pages? You know, this is the… duration of… all scripts. for this group of pages. This is the 75th percentile For this group of pages. And… Now, Nadia, can we look at which pages have the most JavaScript on them, or… Not necessarily by volume, but by execution time. We can break it down and begin to drill into, on a script-by-script basis. Which are the ones that are having the most impact on our visitors' experience. Here, it tends to be the application scripts from Embrace that are the highest duration, but that's kind of to be expected, because this is a JavaScript-based dashboard, whereas, you know, what we begin to see is that some other, Scripts are less impactful here. But, you know, that's not always the case. I've worked with clients in the past where I was actually brought in to help them solve their third-party problems. they were actually in probably the best shape, third-party-wise, that I've ever seen. But by the time I left. Their script problems weren't their third-party ones, it was… The ones from their own application that were actually impacting. visitor. So… you know, being able to see JavaScript execution in the browser, on people's devices. It's kind of like a magic telescope that begins to fill in the piece. Other picture we're missing. But just going back to the exceptions a bit, one of the things you can see as well here, as well as the exception, severity, is we begin to show you You know, how it differs. Versus the previous 7 days. And one of the things about helping engineering teams to ship code faster and ship it with confidence is understanding How are the changes they're making. Are affecting, the quality of their app, and… To help with that, we have a feature called Release Health, and… let me just pick a different time window so we can see. And here we can see… Two versions of our… Or you can begin to see over here… we can see when we cut over from… from one app to the other, and we can… see that actually, you know, there are less errors and less exceptions. Over… With the newer version? But we have some… some slower… spans, so some slower open telemetry spans, and we can kind of dig into… Nope, it's not gonna go. Oh, that's not good. We can dig into the spans in the timeline experience to begin to understand you know, what's happening during this period of time? Not just… where it started and where it ends, but what the dependencies are… were underneath it, what activity was, taking place. And then, lastly, I just want to quickly talk about How we connect performance, to, business outcomes using, user flows. And… you can think of user flows as… A mini-journey, a… a… a… task that starts it, and a task that ends it. And often, you know, we might be measuring, is somebody going through a checkout flow? How's somebody changing from, adding something to a basket. Is somebody following the steps we'd like them to do? internet. And… We can begin to define 8. Float? And we can see how many of our visitors, complete this flow. In this case, we can see most people actually abandoned the flow I chose to look at, but… If there are error logs associated with it, or if there are network errors associated with it, we can highlight those here, and then somebody can Drill in and begin… To understand, again, what happened, from a timeline perspective. And… As part of that. user flow, we begin to see, spans, which are just a named period of time that represents work. And then we can actually drill down to span experience, and here, we don't have any child spans. There was This piece of activity, but no children. But often, there's a whole cascade, a waterfall of spans underneath that help told us, helps inform us what's going on. And that's where we are at the moment. In terms of Other ways of connecting performance to, business outcomes. We are, as part of our… We're going to introduce, more session intelligence and help create correlation charts, probably better correlation charts than we had to speak of, as we, you know. build on the rest of the Embrace team and on the things they bring to us is, you know, we'll do a better job of tying performance to outcomes than we have before. But… I've shown you what's here now, and… you know, we're not stopping here. We have a pretty busy quarter ahead of us, so just so you know, embrace these quarters. the current quarter starts on the… started on the 1st of May. Our first quarter starts on the 1st of Feb, because we don't quite follow the calendar year. So, these are the things we're working on from a Weber Run point of view. Already. So… You know, more signals to help understand when our visitors are frustrated, or when they're engaging. With Chrome delivering Platform standardized support for single-page apps will get out-of-the-box support for instrumenting that without having to do the manual instrumentation. More session… level details, so… More understanding of… Understanding what a visitor's doing through their session, the experience they have through that session. And helping understand how that affects, How do they behave? Signals… you know, we… there are endless signals you can collect from a performance point of view, and We prioritize the ones that are most important, but there are… there are other diagnostics we'd like to add, you know, to help people understand, you know. why their metrics are changing, who they're changing for, and why they're changing as well. And then finally, at the moment, our MCP server exposes network errors and exposes exceptions. one of the things we're going to do in this quarter coming up is actually start to expose the Web Vitals data, as I've just shown you, so that Yeah, people can begin to explore the data on their visitor's experience, couple it with things like Chrome's MCP server, so… To tie that loop of what's happening in the wild and bring it back to debugging locally.

43:23

Cliff Crocker

Do you want me to take this one, Andy, or do you want to take it?

43:25

Andy Davies

Yeah, you're the sampling king.

43:26

Cliff Crocker

I don't know about that. Well, Jason, thanks for the question, and I think you're absolutely right. I think that a lot of people have thought about this, Three different products around, like, why do you need 100% of the data if, you know, you're making decisions on, you know, something that might not require the level of granularity that you've seen us kind of go through today? I think, unfortunately, the answer is going to be it depends.

43:51

Cliff Crocker

We did do quite a bit of analysis on this, in our previous lives at Speedcurve, and we can send it across some of the blog posts that we've done there. But high level, the reason that it depends is it kind of depends on the use case. For a customer to really get, sort of, end-to-end, you know, telemetry, needing to, you know, really focus on 100% of the sessions, it's a much more difficult conversation to talk about where you should sample. For a customer that maybe doesn't need that, and maybe they're… or they're wanting to just get started, or there are cost constraints. It's going to depend on what you're looking at and how you're slicing the data. If you're wanting to get data around experimentation, for example, and you're only experimenting with, like, 1% of your population, obviously, if you're only sampling 5% of the 1%, you're going to get some pretty noisy data. That being said, for a larger enterprise customer, or a customer that's getting decent traffic, we've seen that the overall numbers, when we're looking at something like

44:47

Cliff Crocker

Core Web Vitals, which It's still going to be limited to maybe a subset of your browsers, can be achieved through a, you know, a 10 to, you know, 20% sample rate. That being said, you know, of course, I'm gonna tell you, like, the recommendation is 100%, but, you know, in short, if you can't get there, or it is something that's problematic in terms of cost. I think you just have to sort of lower some of your expectations, or think about, you know, what's that, you know, cross-section of data that you want to look at, number of page views, number of browsers, all that. It starts to become limited the more you, you know, sample that down and get a less representative

45:27

Cliff Crocker

set of data that's not necessarily statistically valid, which is a big reason behind why we show things the way we do in Embrace, where we're showing you where there's low sample, because we don't want you making business decisions on something that, you know, quite frankly, doesn't have a lot of validity.

45:43

Cliff Crocker

So just something to keep in mind. We will follow up with some of those resources, that we've got. We've covered this in greater detail, to give you some more perspective. And yes, it is supported in the platform.

46:13

Andy Davies

I'll take that. So, one of the things to think about in web performance terms is we have something called Performance Measure. and performance measure measures the point between… measures the duration between two points in time. But that's… that's all it does. It has a name. and… It has a start time and it has a duration, so you can And it allows you to, custom instrument you're at, but it's a really simple view. Often… Within… that duration of time, you might want to know, or you often want to know, what else happened. you know, what XHR requests were made, what scripts they executed, and you go from this level of… there's this duration, and you want to break it down, and there's no real effective way of doing that with performance measure at the moment. So what spans allow you to do is say. Okay, this is the duration it took to update this chart. And then I can have a second span underneath that says this is the duration it took to go and get the data for the chart, and this is the day… the duration it took to, actually display the data once I've got it, process it and display it. So it gives you a way of going from that high level and drilling down, just like if we were looking at something in, say, the performance panel in DevTools, where you can see how long a parent task takes, and then how long its children, nested children underneath take.

47:47

Cliff Crocker

And I think the important thing there, too, is also that, you know, it does allow you to do a trace between different systems. So when you have the W3C trace header for the span, that allows you to then link it to other things that are happening on the backend, as well as other attributes that might be on that span itself that you wouldn't just get from a…

48:09

Cliff Crocker

A performance measure, or… Or, you know, other duration metric that we look at in WebPerf. The other thing I wanted to call out about what I think is great about OpenTelemetry is that it does help us fill a lot of gaps. As several of us know, and probably several of us who are on the call know. The browsers aren't exactly at parity all the time with the things that we can measure and the support for APIs. So, not having support for Largest Contentful Paint and INP for years, until, you know, this last couple of years. or last year, really, with WebKit, there really wasn't necessarily a great way to look at a cross-browser, you know, custom metric. Yes, there was user timing, we could use that and rely on it, but, you know, harder to kind of get more information that you're looking for when those browser standards don't exist. So it does give us a lot of coverage in that way, as well.

49:21

Cliff Crocker

Yeah, well, one, I would point us to, the resources at Embrace that have quite a bit of depth, and I believe, like, a post or something that's going out tomorrow from, from Mr. Freeze, who's actually supporting the OpenTelemetry Web SDK at Embrace. So please tune in for that to learn a little bit more. There's a vast amount of things that we could talk about when we talk about OpenTelemetry, but the thing to take away from this is that it's an open standard and way of Of measuring things and getting telemetry that's universal across a distributed system, and something that's, you know, not going to be specific to a given vendor. Andy, you might want to expose a little bit more on that, but…

50:01

Andy Davies

Yeah, I think the way I always consider it, it's a way of… so, we collect data, And it… but it's… ultimately, it's not our data, we're just guardians of it. And OpenTelemetry is a way of exchanging that data using an open format, so… we can collect data using OpenTelemetry and have it in Embrace. People can use our Open SDK… Open… Oh. web SDK, OpenTelemetry compatible Web SDK, to collect data and ingest it into their own systems from, you know, third-party vendors that also support OpenTelemetry, and… you know, we can also forward data on using OpenTelemetry to third parties, so it's… it's a… it's a standard for interchanging data and exchanging data about performance, is how I would describe it, that can be as lightweight or as rich as you want to make it.

51:24

Cliff Crocker

Boy, that's a loaded one. I think that, we're obviously fans of vitals. I think, you know, all of us have seen what it's done for getting visibility into performance across the industry. But it is absolutely not the end-all, be-all of, of measures. You know, we just talked about using spans as a way to measure the things that you care about, the things that you as an application owner know that you want to instrument and get timing around. Which is probably first and foremost the thing to think about. There's a lot of other great stuff that's out there, too. Like, one example yesterday when we were looking at the Nike use case and seeing that the LCP for the poster image was, you know, needed improvement, but what happened after that was probably even more disturbing. Wouldn't necessarily be uncovered by LCP, but might be uncovered by something like element timing, which is something that's out there and supported in Chrome.

52:16

Cliff Crocker

As well as container timing, which is something that's coming about, and I believe in Origin Trial now within Chrome. So there are lots of means for collecting data, especially when you think about how that data is being perceived by an end user that are available to us today that we haven't had before.

52:34

Cliff Crocker

I would suggest Core Web Vitals is a great starting point. From that point, I think you need to listen to your users to understand what it is you actually need to focus on and measure. The last thing I would say is that everything we talked about with Vitals and other things are… tend to be more user experience focused, and that's because of our focus in the industry.

52:53

Cliff Crocker

That being said, there are amazing sets of diagnostic metrics that support those user experience metrics that we can't forget about. Time to first bite. I mean, it's… well, we technically call it a vital. You know, it's still one of those biggest contributors that we see. Other things, like page weight and page bloat, Tammy does a great post and wrap-up every year of, like, where we're at with page bloat these days. You know, looking at not just the duration and measure of things, but, you know, how big and bloated are the things that we're creating as it comes back to page construction. So, there are a number of things we could get into in terms of the other metrics to look at. goes a little bit counter to our desire to try to focus the conversation more as an industry and as a community when we talk about Core Web Vitals, but we can't leave those metrics behind, we can't leave behind the fact that There's not always a one-size-fits-all metric, no matter how hard we try.

53:47

Andy Davies

Yeah, I'm particularly excited about container timing. I was… Looking at the slide deck, you, voluntold me to do. to propose it to the W3C, and Jason Williams of Bloomberg and Jose Famigale did the real work on it. But it gets us to the point where we can time components. The web is built of components, whether it's a product card or a dialogue, and the ability to time from when that component starts to be rendered to when it's

54:20

Andy Davies

it's completed, and all the pieces are in place, I think is, gonna be really powerful.

54:52

Cliff Crocker

I would say, so, first off, we do have a light mode. And certainly helpful when we're starting to look at things and present things. But we haven't really done ourselves many favors as a community when we've talked about Core Web Vitals and really, you know, used red, yellow, green as, like, the thing that we're cued into. I love this line of thinking, and I think that that would be something that we'd love to get feedback on, on what would those alternative colors be. So, Mr. Grisby, just to turn it back over to you, would love to hear feedback about what we could do, because I think that that's a great thing to think about, and oftentimes we don't think about

55:27

Cliff Crocker

You know, color blindness and the way that people are perceiving the products because we get so locked in To a default, so… Great suggestion. We don't have an alternative color scheme for Core Web Vitals today. Let's think of one.

55:54

Andy Davies

Parting advice would be… If you go to dash.embrace.io slash sign up, you can sign up for web room. The other… big thing I would encourage… and then you instrument your site, and we start collecting data. The big thing I would encourage people to do that often doesn't get done immediately is, if I just scoop in here, is Define your page groups. And you can do that via settings down here, but, you know, grouping pages of the same type together. makes a huge difference. And when we talk to some customers who haven't done it yet, so, for example, if I was thinking about a retailer, their product page would be the most visited page type, but because they haven't instrumented it, the fact they've got 50,000 SKUs means every product page has, like, you know, 0.01% of the traffic, whereas when you group them together, it becomes 60%, so…

57:03

Andy Davies

Sign up for WebM. Rum. install the script, start collecting the data, and group your pages together. That's how I would start.

57:57

Andy Davies

Thanks, everyone. Bye.