Bar Chart
Last updated
Last updated
For qualitative variables.
For nominal data, sort the data bu count. For ordinal, it's best to order by category.
The horizontal bar chart can be more convienient if we have a lot of categories or if the category names are long.
One thing that we might want to do with a bar chart is to sort the data in some way. Let's use value_counts for this.
For ordinal-type data, we probably want to sort the bars in order of the variables. While we could sort the levels by frequency like above, we usually care about whether the most frequent values are at high levels, low levels, etc. The best thing for us to do in this case is to convert the column into an ordered categorical data type. By default, pandas reads in string data as object types, and will plot the bars in the order in which the unique values were seen. By converting the data into an ordered type, the order of categories becomes innate to the feature, and we won't need to specify an "order" parameter each time it's required in a plot.
If your data is in a pandas Series, 1-d NumPy array, or list, you can also just set it as the first argument to the countplot
function, as we do with the Series data_var
here:
Horizontal bar chart
Alternatively, you can use matplotlib's xticks
function and its "rotation" parameter to change the orientation in which the labels will be depicted (as degrees counter-clockwise from horizontal):
By default, seaborn's countplot
function will summarize and plot the data in terms of absolute frequency, or pure counts. In certain cases, you might want to understand the distribution of data or want to compare levels in terms of proportions of the whole. In this case, you will want to plot the data in terms of relative frequency, where the height indicates the proportion of data taking each level, rather than the absolute count.
One method of plotting the data in terms of relative frequency on a bar chart is to just relabel the counts axis in terms of proportions. The underlying data will be the same, it will simply be the scale of the axis ticks that will be changed.
The xticks
and yticks
functions aren't only about rotating the tick labels. You can also get and set their locations and labels as well. The first argument takes the tick locations: in this case, the tick proportions multiplied back to be on the scale of counts. The second argument takes the tick names: in this case, the tick proportions formatted as strings to two decimal places.
I've also added a ylabel
call to make it clear that we're no longer working with straight counts.
Rather than plotting the data on a relative frequency scale, you might use text annotations to label the frequencies on bars instead. This requires writing a loop over the tick locations and labels and adding one text element for each bar.
I use the .get_text()
method to obtain the category name, so I can get the count of each category level. At the end, I use the text
function to print each percentage, with the x-position, y-position, and string as the three main parameters to the function.
(Documentation: Text objects)
One interesting way we can apply bar charts is through the visualization of missing data. We can use pandas functions to create a table with the number of missing values in each column.
What if we want to visualize these missing value counts? We could treat the variable names as levels of a categorical variable, and create a resulting bar plot. However, since the data is not in its tidy, unsummarized form, we need to make use of a different plotting function. Seaborn's barplot
function is built to depict a summary of one quantitative variable against levels of a second, qualitative variable, but can be used here.
The first argument to the function contains the x-values (column names), the second argument the y-values (our counts).
As a general note, this is a useful function to keep in mind if your data is summarized and you still want to build a bar chart. If your data is not yet summarized, however, just use the countplot
function so that you don't need to do extra summarization work. In addition, you'll see what barplot
's main purpose is in the next lesson, when we discuss adaptations of univariate plots for plotting bivariate data.