Data Visualization

Overview

There are two main reasons for creating visuals using data:

  1. Exploratory analysis is done when you are searching for insights. These visualizations don't need to be perfect. You are using plots to find insights, but they don't need to be aesthetically appealing. You are the consumer of these plots, and you need to be able to find the answer to your questions from these plots.

  2. Explanatory analysis is done when you are providing your results for others. These visualizations need to provide you the emphasis necessary to convey your message. They should be accurate, insightful, and visually appealing.

The five steps of the data analysis process:

  1. Extract - Obtain the data from a spreadsheet, SQL, the web, etc.

  2. Clean - Here we could use exploratory visuals.

  3. Explore - Here we use exploratory visuals.

  4. Analyze - Here we might use either exploratory or explanatory visuals.

  5. Share - Here is where explanatory visuals live.

Measurement

Qualitative or categorical types (non-numeric types)

  • 1. Nominal data: pure labels without inherent order (no label is intrinsically greater or less than any other)

  • 2. Ordinal data: labels with an intrinsic order or ranking (comparison operations can be made between values, but the magnitude of differences are not be well-defined)

Quantitative or numeric types

  • 3. Interval data: numeric values where absolute differences are meaningful (addition and subtraction operations can be made)

  • 4. Ratio data: numeric values where relative differences are meaningful (multiplication and division operations can be made)

All quantitative-type variables also come in one of two varieties: discrete and continuous.

  • Discrete quantitative variables can only take on a specific set values at some maximum level of precision.

  • Continuous quantitative variables can (hypothetically) take on values to any level of precision.

Visuals

Experts and researchers have determined the types of visual patterns that allow humans to best understand certain information. In general, humans are able to best understand data encoded with positional changes (differences in x- and y- position as we see with scatterplots) and length changes(differences in box heights as we see with bar charts and histograms).

Alternatively, humans struggle with understanding data encoded with color hue changes (as are unfortunately commonly used as an additional variable encoding in scatter plots - we'll study this in upcoming concepts) and area changes (as we see in pie charts, which often makes them not the best plot choice).

Chart junk

Chart junk refers to all visual elements in charts and graphs that are not necessary to comprehend the information represented on the graph or that distract the viewer from this information.

Examples of chart junk you saw in this video include:

  1. Heavy grid lines

  2. Unnecessary text

  3. Pictures surrounding the visual

  4. Shading or 3d components

  5. Ornamented chart axes

Data-ink ratio

The data-ink ratio, credited to Edward Tufte, is directly related to the idea of chart junk. The more of the ink in your visual that is related to conveying the message in the data, the better.

Limiting chart junk increases the data-ink ratio.

Design Integrity Notes

It is key that when you build plots you maintain integrity for the underlying data.

One of the main ways discussed here for looking at data integrity was with the lie factor. Lie factor depicts the degree to which a visualization distorts or misrepresents the data values being plotted. It is calculated in the following way:

The delta symbol (\DeltaΔ) stands for difference or change. In words, the lie factor is the relative change shown in the graphic divided by the actual relative change in the data. Ideally, the lie factor should be 1: any other value means that there is some mismatch in the ratio of depicted change to actual change.

Lie Factor in the Video

The number of pixels related to the largest image is 79,000 and 16,500 for the smallest. The percentage change is 27% to 12%. So, the lie factor is calculated as:

\text{lie factor} =\frac{(79000-16500)/16500}{(27-12)/12} = 3.03lie factor=(27−12)/12(79000−16500)/16500​=3.03

Further Reading

Examples

Data Integrity problem. The green slice here looks larger than the purple slice due to the 3D nature of this plot, but the percentages suggest otherwise.

Updated Visual

The same data presented in the image above is recorded in the spreadsheet here. This data was used to create the following visual; the next question asks how this plot could be improved.

There are always personal preferences. The visual below is a good visualization of this data from a design, following the principles of:

  1. reducing chart junk,

  2. maintaining a high data-ink ratio,

  3. maintaining data integrity, and

  4. using length to show changes and differences rather than areas.

Color

Color can both help and hurt a data visualization. Three tips for using color effectively.

  1. Before adding color to a visualization, start with black and white.

  2. When using color, use less intense colors - not all the colors of the rainbow, which is the default in many software applications.

  3. Color for communication. Use color to highlight your message and separate groups of interest. Don't add color just to have color in your visualization.

Color blindness

To be sensitive to those with colorblindness, you should use color palettes that do not move from red to green without using another element to distinguish this change like shape, position, or lightness. Both of these colors appear in a yellow tint to individuals with the most common types of colorblindness. Instead, use colors on a blue to orange palette.

Further Reading

Shape, size and other tools

As seen earlier in the lesson, we typically try to use position on the x- and y- axes to encode, or depict the value of variables. If we have more than two variables, however, we have to start considering other visual encodings for the additional variables.

In general, color and shape are best for categorical variables, while the size of marker can assist in adding additional quantitative data, as we demonstrated here.

Only use these additional encodings when absolutely necessary. Often, overuse of these additional encodings suggest you are providing too much information in a single plot. Instead, it might be better to break the information into multiple individual messages, so the audience can understand every aspect of your message. You can also build in each aspect one at a time, which you saw in the previous lesson with Hans Rosling. This feels less overwhelming than if you just saw this plot all at once.

Last updated