# Data Visualization

### Overview

There are two main reasons for creating visuals using data:

1. **Exploratory** analysis is done when you are searching for insights. These visualizations don't need to be perfect. You are using plots to find insights, but they don't need to be aesthetically appealing. You are the consumer of these plots, and you need to be able to find the answer to your questions from these plots.<br>
2. **Explanatory** analysis is done when you are providing your results for others. These visualizations need to provide you the emphasis necessary to convey your message. They should be accurate, insightful, and visually appealing.

The five steps of the data analysis process:

1. **Extract** - Obtain the data from a spreadsheet, SQL, the web, etc.<br>
2. **Clean** - Here we could use expl**or**atory visuals.<br>
3. **Explore** - Here we use expl**or**atory visuals.<br>
4. **Analyze** - Here we might use either expl**or**atory or expl**an**atory visuals.<br>
5. **Share** - Here is where expl**an**atory visuals live.

### Measurement

**Qualitative or categorical types (non-numeric types)**

* **1. Nominal data**: pure labels without inherent order (no label is intrinsically greater or less than any other)
* **2. Ordinal data**: labels with an intrinsic order or ranking (comparison operations can be made between values, but the magnitude of differences are not be well-defined)

**Quantitative or numeric types**

* **3. Interval data**: numeric values where absolute differences are meaningful (addition and subtraction operations can be made)
* **4. Ratio data**: numeric values where relative differences are meaningful (multiplication and division operations can be made)

All quantitative-type variables also come in one of two varieties: **discrete** and **continuous**.

* **Discrete** quantitative variables can only take on a specific set values at some maximum level of precision.
* **Continuous** quantitative variables can (hypothetically) take on values to any level of precision.

![](/files/-Lhq_jU-6btDndTJjSfV)

### Visuals

Experts and researchers have determined the types of visual patterns that allow humans to best understand certain information. In general, humans are able to *best* understand data encoded with **positional changes** (differences in x- and y- position as we see with scatterplots) and **length changes**(differences in box heights as we see with bar charts and histograms).

Alternatively, humans *struggle* with understanding data encoded with **color hue changes** (as are unfortunately commonly used as an additional variable encoding in scatter plots - we'll study this in upcoming concepts) and **area changes** (as we see in pie charts, which often makes them not the best plot choice).

**Chart junk**

Chart junk refers to all visual elements in charts and graphs that are not necessary to comprehend the information represented on the graph or that distract the viewer from this information.

Examples of chart junk you saw in this video include:

1. Heavy grid lines
2. Unnecessary text
3. Pictures surrounding the visual
4. Shading or 3d components
5. Ornamented chart axes

Data-ink ratio

The **data-ink ratio**, credited to Edward Tufte, is directly related to the idea of chart junk. The more of the ink in your visual that is related to conveying the message in the data, the better.

Limiting chart junk increases the data-ink ratio.

#### Design Integrity Notes <a href="#design-integrity-notes" id="design-integrity-notes"></a>

It is key that when you build plots you maintain integrity for the underlying data.

One of the main ways discussed here for looking at data integrity was with the **lie factor**. Lie factor depicts the degree to which a visualization distorts or misrepresents the data values being plotted. It is calculated in the following way:

![](/files/-LhqfltT3xLDEEdQWAov)

The delta symbol (\DeltaΔ) stands for difference or change. In words, the lie factor is the relative change shown in the graphic divided by the actual relative change in the data. Ideally, the lie factor should be 1: any other value means that there is some mismatch in the ratio of depicted change to actual change.

#### Lie Factor in the Video <a href="#lie-factor-in-the-video" id="lie-factor-in-the-video"></a>

The lie factor shown in the video was in comparing the largest to the smallest doctor in terms of pixels.[![](https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ac937a_pasted-image-0/pasted-image-0.png)](https://classroom.udacity.com/nanodegrees/nd002/parts/9f7e8991-8bfb-4103-8307-3b6f93f0ecc7/modules/1dc09d28-5703-493c-aab5-a418b8bfa3e1/lessons/a755a07d-345f-4e57-91c2-3b3b9ae7cc28/concepts/362c4186-b019-49c7-9c6f-eaeecfe64f56#)

The number of pixels related to the largest image is 79,000 and 16,500 for the smallest. The percentage change is 27% to 12%. So, the lie factor is calculated as:

\text{lie factor} =\frac{(79000-16500)/16500}{(27-12)/12} = 3.03lie factor=(27−12)/12(79000−16500)/16500​=3.03

#### Further Reading <a href="#further-reading" id="further-reading"></a>

* Flowing Data: [How to Spot Visualization Lies](https://flowingdata.com/2017/02/09/how-to-spot-visualization-lies/)

### Examples

![](/files/-LhqgFEzK6EYeUSYcWjZ)

Data Integrity problem. The green slice here looks larger than the purple slice due to the 3D nature of this plot, but the percentages suggest otherwise.

![](/files/-LhqgMmyRCyazkpo2HE1)

#### Updated Visual <a href="#updated-visual" id="updated-visual"></a>

The same data presented in the image above is recorded in the spreadsheet [here](https://docs.google.com/spreadsheets/d/1mFx04Ejvl8Wq5Os1CEyix8HvDx0Ud8xnXFeYNujENz4/edit#gid=1668843239). This data was used to create the following visual; the next question asks how this plot could be improved.

![](/files/-LhqgRvJTnCjcgro4a4n)

![](/files/-LhqgUCurIjFdxmIqqaq)

There are always personal preferences. The visual below is a **good visualization** of this data from a design, following the principles of:

1. reducing chart junk,
2. maintaining a high data-ink ratio,
3. maintaining data integrity, and
4. using length to show changes and differences rather than areas.

![](/files/-LhqgajAQKnAurcvCnNn)

### Color

Color can both help and hurt a data visualization. Three tips for using color effectively.

1. Before adding color to a visualization, start with black and white.<br>
2. When using color, use less intense colors - not all the colors of the rainbow, which is the default in many software applications.<br>
3. Color for communication. Use color to highlight your message and separate groups of interest. Don't add color just to have color in your visualization.

**Color blindness**

To be sensitive to those with colorblindness, you should use color palettes that **do not move from red to green** without using another element to distinguish this change like shape, position, or lightness. Both of these colors appear in a yellow tint to individuals with the most common types of colorblindness. Instead, **use colors on a blue to orange palette**.

**Further Reading**

* Tableau Blog: [5 tips on designing colorblind-friendly visualizations](https://www.tableau.com/about/blog/2016/4/examining-data-viz-rules-dont-use-red-green-together-53463)

### Shape, size and other tools&#x20;

As seen earlier in the lesson, we typically try to use position on the x- and y- axes to encode, or depict the value of variables. If we have more than two variables, however, we have to start considering other visual encodings for the additional variables.

In general, **color and shape** are best for **categorical** variables, while the **size of marker** can assist in adding additional **quantitative data**, as we demonstrated here.

Only use these additional encodings when absolutely necessary. Often, overuse of these additional encodings suggest you are providing too much information in a single plot. **Instead, it might be better to break the information into multiple individual messages**, so the audience can understand every aspect of your message. You can also build in each aspect one at a time, which you saw in the previous lesson with [Hans Rosling](https://classroom.udacity.com/nanodegrees/nd098/parts/05a8ba39-eb63-426e-9b98-755883bc81d6/modules/f922e0f7-d718-4d00-a0e7-13b8498cf7d3/lessons/4535e649-760e-466d-a3e3-18440486dbd1/concepts/6719321b-ecaa-4342-bbed-63bd404408e4). This feels less overwhelming than if you just saw this plot all at once.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://julienbeaulieu.gitbook.io/wiki/sciences/programming/data-analysis/data-visualization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
