Philosophy
AlgebraOfGraphics aims to be a declarative, question-driven language for data visualizations. This section describes its main guiding principles.
From question to plot
When analyzing a dataset, we often think in abstract, declarative terms. We have questions concerning our data, which can be answered by appropriate visualizations. For instance, we could ask whether a discrete variable :x
affects the distribution of a continuous variable :y
. We would then like to generate a visualization that answers this question.
In imperative programming, this would be implemented via the following steps.
- Pick the dataset.
- Divide the dataset into subgroups according to the values of
:x
. - Compute the density of
:y
on each subgroup. - Choose a plot attribute to distinguish subgroups, for instance
color
. - Select as many distinguishable colors as there are unique values of
:x
. - Plot all the density curves on top of each other.
- Create a legend, describing how unique values of
:x
are associated to colors.
While the above procedure is certainly feasible, it can introduce a cognitive overhead, especially when more variables and attributes are involved.
In a declarative framework, the user needs to express the question, and the library will take care of creating the visualization. Let us solve the above problem in a toy dataset.
plt = data(df) # declare the dataset
plt *= density() # declare the analysis
plt *= mapping(:y) # declare the arguments of the analysis
plt *= mapping(color = :x) # declare the grouping and the respective visual attribute
draw(plt) # draw the visualization and its legend
No mind reading
Plotting packages requires the user to specify a large amount of settings. The temptation is then to engineer a plotting library in such a way that it would guess what the user actually wanted. AlgebraOfGraphics follows a different approach, based on algebraic manipulations of plot descriptors.
The key intuition is that a large fraction of the "clutter" in a plot specification comes from repeating the same information over and over. Different layers of the same plot will share some but not all information, and the user should be able to distinguish settings that are private to a layer from those that are shared across layers.
We achieve this goal using the distributive properties of addition and multiplication. This is best explained by example. Let us assume that we wish to visually inspect whether a discrete variable :x
affects the joint distribution of two continuous variables, :y
and :z
.
We would like to have two layers, one with the raw data, the other with an analysis (kernel density estimation).
Naturally, the axes should represent the same variables (:y
and :z
) for both layers. Only the density layer should be a contour plot, whereas only the scatter layer should have some transparency and be grouped (according to :x
) in different subplots.
plt = data(df) *
(
visual(Scatter, alpha = 0.3) * mapping(layout = :x) +
density() * visual(Contour, colormap = Reverse(:grays))
) *
mapping(:y, :z)
draw(plt)