Data Visualization Makeover

This post is a little embarrassing.

I’m going to share some not-so-pretty (and not-so-effective) graphics that I have made…and presented…multiple times. Then I’ll walk you through the steps I took to create new and improved versions using my seven elements of good data visualization.

First, about the data. These data were collected by the Nebraska On-Farm Research Network. Farmers working with the network were trying to determine the optimum planting rate (population) for soybeans. Each picked several planting rates (treatments), replicated and randomized these treatments, and at the end of the growing season, recorded the yields. Their goal was to determine which planting rate maximized yield and, more importantly, profit.

Initially, this is how I presented the data, circa 2015.

Original presentation of soybean population data.
Original graph

Frightening, right?

When presenting the data in live presentations, I broke the first monster graph down into a series of three slides, one for each year.

Graph of 2006 soybean population data.
Original graph of 2006 data
Original presentation of 2007 soybean population data.
Original graph of 2007 data
Original presentation of 2008 soybean population data.
Original graph of 2008 data

Can you determine what the “take away” message of these graphs is?

My intended message was that research results had shown very little yield increase as soybean seeding rate increased, within the range of seeding rates we tested.

Were you able to come up with that message?

Overall, I think these graphs are much more difficult to interpret than necessary.

I set about to give these graphs a makeover for a 2017 presentation, using the seven elements of data visualization that I presented in my previous eXtension blog. I took the following steps, which illustrate my thought process in improving this graph.

Step 1: What is the point?

After thinking through my point, I boiled down my message to this: “Soybean yields increased very minimally as seeding rates increased above 90,000 seeds per acre.” A second important point was: “The lowest seeding rate was most economical because the increase in yield realized by increasing seeding rates did not offset the increase in seed cost.”

Stating this message explicitly provided much-needed direction – and helped me determine what was not important. Because I was concerned with the overall pattern in the results, the precise location and year of each study were not important.

Step 2: Choosing the right chart

Scatterplots can be an excellent way to show a relationship between two things (like soybean planting rates and yield). Connecting lines imply the continuity of the data and can allow us to compare multiple series of the data (in this case, research sites). In my case I was wanting to show a lack of difference, so I started experimenting with plotting the data in this way.

Displaying the data in a scatterplot with connecting lines
Scatterplot with connecting lines

This starts to better communicate the data and gets all the information onto one manageable sized graph, but it is still busy. The legend is extensive and doesn’t provide needed information. The Y-axis also goes from 35 to 80 (rather than 0 to 80), which is misleading and makes the differences appear to be greater than they are.

Step 3: Less is More

Here I have removed the legend (and unnecessary site and year designations) and reset the y-axis to 0 to 80.

Modified scatterplot with y-axis from 0 to 80 and legend removed
Modified scatterplot with y-axis from 0 to 80 and legend removed

At this point, after consultation with a statistician, I decided to include only the sites where the same four planting rates of 90,000, 120,000, 150,000, and 180,000 seeds/acre were tested. This provided a better fit for the data and let us perform a more appropriate statistical test. I also updated the data to include data that had been collected in 2016 from three additional sites. Since these sites tested planting rates of only 90,000 to 180,000, I updated the x-axis to include only this range.

Scatterplot with adjusted x-axis.
Scatterplot with adjusted x-axis

Now the trend is starting to be more evident.

Step 4: Use color intentionally

Since color is no longer connected to site and year designations in a legend, it can be used to emphasize other information. From presenting the data in the past, I knew that people often asked if the trends shown by the data were true for both non-irrigated and irrigated conditions. For this reason, I chose to use color to designate irrigated sites (blue) and non-irrigated sites (orange). I also made the gridlines and axis numbering a lighter grey and less prominent, since we are concerned with the overall trend rather than exact values. Already, this graph is much less overwhelming to look at.

Graph using color to designate irrigation
Use color to designate irrigation

I needed to designate what the orange and blue colors were indicating. Instead of a separate legend, I included text in colors that coordinated with the colored lines on the graph. This strategy allows viewers to get almost all the information they need immediately when they are looking at the graph rather than having to look back and forth between a legend and the chart.

Chart leveraging consistent color for labeling
Leverage consistent color for labeling

Step 5. Create pointed titles and call out key points with text

At this point the trend is fairly obvious and there is room to add in the average statistics. I used a black line and a larger font to make the average more prominent directly on the graph. I noted actual values for the average statistic only. (Think how cluttered it would be to show values on the chart for every data point.) Showing the values only for the average communicates the important information – the overall trend.

Add average statistic
Add average statistic

The last step was adding a title. Rather than a boring, uninformative title like “Yield versus planting rate for soybeans in Nebraska” I tried to bring my main point home using the title on the graph below.

Add an informative title
Add an informative title

The final addition I made was to include source information on the bottom right. This gives credibility, lets people know where to find more info, and, in the case of data you have collected yourself, is a great way to promote Extension or your university.

Step 6: Get feedback and iterate

As I presented this data at winter meetings, a common question was “What were the average final stands for each planting population?” or “How many soybeans do I need to have at the end of the year?” To try to address these questions, I created an iteration that displayed this information with the planting populations. This version seems a little more cluttered to me, but I think it is worth it since that information was being requested.

Iteration based on audience questions
Iteration based on audience questions

As you may recall, a secondary objective was to communicate that increasing soybean seeding rate did not pay off in terms of increased yield. Rather than create a separate graphic for this objective, I used the yield data presented in this graphic to demonstrate the very small yield increase, and then when presenting, provided a second slide with calculations of profit for each rate. This worked out well as a discussion slide.

Before and After

Here are the completed “before and after” graphics. What do you think? Does this graphic better communicate the main point? What changes would you make? Let me know in the comments!