From the course: Excel Data Visualization: Mastering 20 Charts and Graphs

Scatter plots

- [Instructor] All right, let's talk Scatter Plots. Scatter Plots are a great way to explore correlations or relationships between series. So, for instance, if you want to know if Major League Baseball players who hit more home runs are more likely to earn higher salaries, Scatter Plots is the perfect way to address that question. Other examples could be measuring the relationship between ice cream sales and average temperature, looking at hours of television watched by age. Literally any two data sets that you want to test for correlation, you can use a Scatter Plot to do so. Pro Tips here, first and foremost, add a trendline or line of best fit to show correlation between variables. You can also add an R-squared value, like the ones in these screenshots, to really quantify how strong that correlation really is. Second to the last but not least, remember that correlation does not imply causation. What that means is that two variables or two data sets might move in the same direction and might seem very tightly correlated, but that does not indicate that one causes the other. Here's a perfect example. We see a very close correlation between ice cream sales and deaths by drowning. Now, ask yourself, do you think ice cream sales are actually causing drowning deaths? Of course not. They're both associated with higher temperatures and summer seasons so both tend to reflect the same seasonality. With that, I've got a great data set in Excel that we can use to build some of these Scatter Plots. So, let's go ahead and open up our workbook. All right, so here we are on the Scatter Plots and Bubble Charts tab. And the data set we're looking at here is Major League Baseball data by year, by team. I've got a number of metrics here, Home Runs, Runs Scored, Run Differential, which is the difference between Runs Allowed and Runs Scored, at Wins, Losses, and Team Salary, and got data going all the way down from 2000 to 2015. So, this will give us a really nice sample data set to build some Scatter Plots. Now, the first thing that one might hypothesized is that the number of Runs Scored may be highly correlated with the number of Wins that the team scores in a given season. So, let's go ahead and try to plot that with the Scatter Plot. I can select all of Column G, Ctrl click all of Column K. RS is my Runs Scored column, W is my Wins column. If I can go to Insert, my Scatter Plots are right down here. So, insert just a regular Scatter Plot and drag it over right next to our data set. So, it's a little bit messy right now. We don't have axis labels, the title just says W, so we'll need to change some things here. First off for the title, what we're looking at here, all these individual points are individual teams. So, I'm going to title this MLB Teams 2000 through 2015. Next step I want to add some axis labels just to make it clear. So, Primary Vertical, this is my Wins axis. And then I'm going to add a second one for my Primary Horizontal, this is Runs Scored. Now, obviously, it's tough to get an insight of this because it's just kind of a little cloud of data. That's because both of our axis are extending all the way to a minimum of zero. So, what we can do here is format the axis. We know that there is no team that's scored fewer than 450 runs in a given year. So, we can use that as a minimum. That really just helps span the picture and see things a bit more clearly. In similar case with Wins, we can set a minimum to 40, and now we get kind of a better picture of what this cloud looks like, and what the relationship might look like. So, if I right-click the data series, I can add a trendline here. It's kind of tough to read so I'm going to format this. Fill option, I'm going to make it a solid line. This orange is fine. I'll just make the width 3 points so it jumps out a little bit, and also format my data series on all these individual points. And the options look kind of minimal here. If you try to mess with Line options, obviously that's bad news. These are really just points or markers, so you need to go into your Marker Options. And I don't want any borders on these, I want to fill them with just a lighter shade of blue so they're not as distracting compared to the trendline. So, now with the trendline, I'll just go ahead and display that R-squared value. I'll make that bold so it jumps out a little bit and move it into some free space. There you go. What this is telling me is that the correlation between Runs Scored and Wins is not very strong. It's only an R-squared value of .26. So what if we test something else. If you know baseball, you might be thinking, Well, some teams are going to score a lot of runs, but if they also allow a lot of runs, they're not going to win a lot of games. So, Wins are more likely a function of both Runs Scored and Runs Allowed. So, why don't we try building a Scatter on Run Differential instead of just Runs Scored. So, same process here, I'm going to select I and select K with Ctrl. I'm going to insert another Scatter here. You can already see that there is a tighter relationship here between these two variables. It's a little wonky because that Run Differential field obviously has positive and negative values. So, the best way to combat that is to right-click, format the axis. And this area that says Vertical Axis Crosses, right now, it's Automatic, set to cross at zero. I can manually say no. Let's have that cross at negative 400 and it just moves it right on over. So, this looks great. Again, I'm going to change the title here to the same thing. MLB Teams 2000 through 2015. Perfect. Just for the sake of being consistent, why don't we add axis titles again. Vertical is, again, Wins and horizontal this time isn't Runs Scored, it's Run Differential. So, let's right-click the series. Let's add a trendline and let's do just what we did before, and make it a solid orange line, three-point width, and I'll also update my data series. So, I want to change the Marker Options. No border, solid fill. So, there you go. That's basically the same Scatter Plot. The only difference is that I'm showing Runs Scored against Wins on the top one and Run Differential against Wins on the bottom. So, why don't we go ahead and add that R-squared value here. And as you can see, R-squared is .875, which is a much, much tighter fit. So, there you go. That's two examples of building Scatter Plots to show the relationships between the two series of data.

Contents