Hey! I worked on the plotnine guide (https://plotnine.org/guide/). Always interested to hear what people find hard to understand about plotnine, or what they wish there were more of (e.g. examples, guide pages, api reference docs).
(Both has2k1 and I work for Posit, which supports plotnine work, but authoring its guide was mostly an act of passion for me :)
Plotnine has been great in my usage, but I see violin plots on the front page. Just say no to violin plots.
In almost any situation you either want to talk about the actual distribution (in which case plotting the distribution on one side of the line arranged horizontally is significantly superior to plotting it vertically on both sides of the line for some reason as a violin plot does[1]) or you want to talk about the quartiles etc in which case a boxplot is better.
A violin plot tries to do both and as a result does them both badly.
[1] I remember in one meeting before I knew better, producing some violin plots and putting them on a slide and I knew I had gone wrong when that slide came up and everyone in the room had this confused expression on their faces and was leaning their head over to the side to try to see the distribution better. When your visualization produces obvious confusion like that, you can be completely certain it has failed.
The arguments in the video are mostly misdirected, conflations of related but distinct issues, or (hypocritically) just opinions. Here are the unpacked issues as I see it:
1. Whether to summarize data by a handful of summary statistics or a full density. Obviously, some statistics reported in isolation can misrepresent the underlying distribution, but these considerations ultimately depend on what specific point one seeks to make with a plot. There's no reason a priori that visually annotating summaries/quantiles on a distribution plot can't be helpful (quite the contrary).
2. Whether to "smooth the data" (read: perform kernel density estimation). In some sense this is a long-solved problem: there are mathematically grounded methods for estimating the optimal KDE bandwidth (with varying degrees of assumption on the underlying distribution), which are what's used by any serious plotting library. And whether authors adequately describe what they're actually plotting is a separate matter. That said, there are many reasons not to show a KDE over, say, a binned histogram, especially with raw data and/pr small sample sizes, but these are entirely orthogonal to the choice of displaying a KDE as a violin plot versus something else.
3. How to normalize densities. With raw data, you probably want to compare frequencies (a proper pdf). If displaying just a single distribution, there's obviously no reason not to show the density (it's only a trivial rescaling of the axis tick values). When comparing multiple, the decision again depends on the point of the plot. Losing dynamic range for broader densities when compared with narrower ones can be counterproductive. E.g., in Bayesian parameter inference (where the data are MCMC samples), we almost never compute the actual normalization factor, but rather want to compare relative probabilities (i.e., within a single distribution) of different parameter values across different distributions. Of course, nothing forces one normalization over another for violin plots.
All of those are separate (and rectifiable!) issues from the defining characteristics of a violin plot:
1. Distributions displayed vertically rather than horizontally (both being harder to interpret and inappropriately suggestive). We almost exclusively visualize functions plotted vertically across a horizontal coordinate. I think this is the only valid, specific criticism of the common violin style itself, but the fix is of course trivial.
2. Horizontal violin are then only different from a ridge plot by a) not overlapping (which to me is a major improvement over standard ridge plots, but also trivially fixed) and b) being displayed symmetrically. I find it slightly easier to compare relative heights in the symmetric version, especially when comparing many distributions (such that each is relatively narrow). Even if not, the difference is so superficial/trivial that I don't find it worth arguing about.
Beyond this, the video's main argument (repeated every minute) seems to be that "it's bad, it's just bad", but there are only so many ways to make a 5 minute argument fill a 42 minute video. (This style of video is so grating to me.)
I've always liked the ggplot2 and the Grammar of Graphics approach to plotting so much so that I wrote my own DSL based on it - it is standalone, written in Rust, has WASM bindings (as you can see on the website) and more:
There's LSPs for both, LSP clients for VS Code, and even language diagnostics for standalone Monaco editors in the browser.
Of note is that the same language diagnostics are exposed via the WASM as via the LSP interface allowing for the same friendly red squiggles to look and work the same in both your browser with Monaco and your editor with the LSP!
Sorry for the confusion. Though, it is a mango tree in a mango garden! The continued development and maintenance of plotnine is supported by Posit, PBC, the same company behind the Tidyverse.
While it is good to have plotnine for python but ggplot2 in R has this whole ecosystem of packages/extensions that augment the ggplot2 such as ggalluvial, ggrepel. More at this link https://exts.ggplot2.tidyverse.org/gallery/
This would make it really compelling to tip my toes in the Python world. Regardless, I am liking this "ggplotification" of python. All we need is more "tidification" of Python data frame world.
matplotlib is what most folks reach for by default, but I'd argue that's habit and popularity talking, not a real fit advantage. For a lot of plotting work, the grammar-of-graphics libraries — Altair, plotnine — may be the better tool for many people (and Agents); they just don't have the muscle memory behind them.
(Disclosure: I'm at Posit, which supports plotnine.)
What would you say are the benefits of the grammar-of-graphics approach? I've been working with plotting for more than a decade now and have never heard of it. Right now I'm looking through the gallery and can't really grasp what makes this approach better than the one in matplotlib.
PS. It took someone in the comments writing "import plotnine as p9" for me to understand it isn't plotLine.
It has to do with the structure of the information/inputs. Its more a DSL with a structure rather than a jumble of API arguments. So first few times out it may take some getting used to, but as that structure becomes clear it becomes more intuitive. Thats also why even with way more matplotlib examples in the world, LLM agents do better "guessing" how to use plotnine... or at least it was that way a few months ago when we were testing them head to head.
> What would you say are the benefits of the grammar-of-graphics approach?
When a mathematical formalism exists, just use that. Other approaches just reinvent the wheel on an ad-hoc/piecemeal basis and end up making all sorts of unnecessary compromises.
I have used neither in quite a while now but there is an alternative from jetbrains that i started using because it shares the same ergonomics and had better (?) documentation.
It seems like the major difference between plotnine and lets-plot is that plotnine wraps Matplotlib (and thus works everywhere Matplotlib is available, but doesn't offer much interactivity), while lets-plot is written in Kotlin and seems to provide interactive plots.
Love plotnine when I switched over to python and great to see the project develop! But I have to admit I ended up switching to altair after all which has been my go to in python now.
For another grammar-of-graphics-based visualization library (flexibly compose charts rather than simply pick a template), check out Altair https://altair-viz.github.io.
While it’s generally considered to be bad practice to import everything into the global namespace, I think it’s fine to do this in an ad-hoc environment such as a notebook as it makes using the many functions plotnine provides more convenient. An additional advantage is that the resulting code more closely resembles the original ggplot2 code. Alternatively, it’s quite common to `import plotnine as p9` and prefix every function with `p9`.
Disclaimer: I made the plotnine homepage and cheatsheet.
The issue I find with this pattern in docs/tutorials is that the prefix makes it very obvious which functions are from this library.
It's particularly worthwhile when looking at the bigger examples that might involve another library, or stdlib functions/libs I haven't dealt with before
To each his own. Some drawbacks here are that it means that if you want to copy the code in, you have to add all the “p9”s. And, if you want to make a more complicated demo, with maybe a second import, you now have multiple conventions in your demo codebase.
Because most of the time this will be used is not part of a software development project but rather producing publication plots in a script or plots in a notebook. Not what you would want to do when incorporating it into a web app.
Like others mention, this is inspired by ggplot2, a Grammar of Graphics library.
The whole idea is graphics are composed by adding "layers", not like layers on a canvas, but like pouring paint into a pot, then the library understands the content and paints it to the canvas.
Layers might be pure data, geometry (lines, points, ...), annotations, styles, axis, etc.
When you get familiar with it, it's much more natural way of describing plots, better composition and easier exploration
And a side note, R actually includes a pipe operator (as in Unix `|`), and said if they'd know about it at the time, they would have never used the `+` operator
Because, in the latter case, you have to declare a function argument for /every possible option/ that you want your graphics API to expose, and you need to do this every time you add a new option.
On the other hand, declaring the options through composition means that the API for "plot" remains static, and adding/removing options can be done trivially without an API change.
Composition (rather than parameters) is also more flexible. Let's say you want to divide your plot into three sub-plots, two of which are 200x200, and another which is 200x400. How do you express this as a keyword parameter? In composition, you could do something like:
Isn't this somewhat the point of python's override capability? You can exactly define what addition means here, just like you have to define what addition means for a matrix or another data structure.
Separately you might find this to be smelly design, but then you should remember that this at least has precedence ¯\_(ツ)_/¯
Honestly operator overloading is almost always a bad choice IMO. There are cases (e.g. matrix math) where it's the right choice because of semantic clarity, but the indirection it creates in exchange for readability is costly. It's always obvious when a function is being called, and how to navigate to the implementation, but the same is not true of overloaded operators.
Well, it's like most tools. For this particular use, does it give more than it takes? If so, use it.
Matrix multiplication? Yeah, everybody knows there's a function being called. And, if it was implemented right, the users almost never have to look at the implementation.
Many other possible uses? Nope. Just nope. Not worth it.
Back when I did a lot of data stuff I used ggplot in R because it seemed to be popular, but I was just copy/pasting examples. Then one day I finally started to "get it" and actually read the manual. Learning the grammar of graphics was like a super power. I got to the point I could open pretty much anything people sent me and visualise it in a matter of seconds.
Although I've used Python professionally a lot more than R, I still felt like R was better at this. Somehow opening files in Python always feels a bit more "heavy". I don't really know why, though.
Hey! I worked on the plotnine guide (https://plotnine.org/guide/). Always interested to hear what people find hard to understand about plotnine, or what they wish there were more of (e.g. examples, guide pages, api reference docs).
(Both has2k1 and I work for Posit, which supports plotnine work, but authoring its guide was mostly an act of passion for me :)
If you already use plotnine, or if this has piqued your interest, the next release (v0.16.0) will bring nice capabilities.
You can get a sneak peek by installing the pre-release:
pip install --pre plotnine
Details here: https://github.com/has2k1/plotnine/issues/1031
Disclaimer: I'm the author.
Minor nit, the "Installing" link on the linked page leads to the general documentation.
Love your work on this, thanks for bringing the ggplot syntax to Python!
Plotnine has been great in my usage, but I see violin plots on the front page. Just say no to violin plots.
In almost any situation you either want to talk about the actual distribution (in which case plotting the distribution on one side of the line arranged horizontally is significantly superior to plotting it vertically on both sides of the line for some reason as a violin plot does[1]) or you want to talk about the quartiles etc in which case a boxplot is better.
A violin plot tries to do both and as a result does them both badly.
Extended anti-violin plot rant here https://www.youtube.com/watch?v=_0QMKFzW9fw
[1] I remember in one meeting before I knew better, producing some violin plots and putting them on a slide and I knew I had gone wrong when that slide came up and everyone in the room had this confused expression on their faces and was leaning their head over to the side to try to see the distribution better. When your visualization produces obvious confusion like that, you can be completely certain it has failed.
Violin plots have an interesting reputation (https://xkcd.com/1967/, https://www.reddit.com/r/labrats/comments/91ex4u/is_it_just_..., https://jabde.com/2022/12/22/banned-violin-plots/).
For showing distributions, I much prefer strip plots (https://seaborn.pydata.org/generated/seaborn.stripplot.html), perhaps with opacity, or swarm plots (https://seaborn.pydata.org/generated/seaborn.swarmplot.html) - no averaging with an unknown kernel, no hiding distributions behind a box plot, and the data is directly visible. We also directly see whether it is based on 5, 100, or many more points.
When using histograms, binning is usually more straightforward than kernels. And in any case, the mirror reflection of a histogram is not needed.
Another alternative is raincloud plot (depending on data).
The arguments in the video are mostly misdirected, conflations of related but distinct issues, or (hypocritically) just opinions. Here are the unpacked issues as I see it:
1. Whether to summarize data by a handful of summary statistics or a full density. Obviously, some statistics reported in isolation can misrepresent the underlying distribution, but these considerations ultimately depend on what specific point one seeks to make with a plot. There's no reason a priori that visually annotating summaries/quantiles on a distribution plot can't be helpful (quite the contrary).
2. Whether to "smooth the data" (read: perform kernel density estimation). In some sense this is a long-solved problem: there are mathematically grounded methods for estimating the optimal KDE bandwidth (with varying degrees of assumption on the underlying distribution), which are what's used by any serious plotting library. And whether authors adequately describe what they're actually plotting is a separate matter. That said, there are many reasons not to show a KDE over, say, a binned histogram, especially with raw data and/pr small sample sizes, but these are entirely orthogonal to the choice of displaying a KDE as a violin plot versus something else.
3. How to normalize densities. With raw data, you probably want to compare frequencies (a proper pdf). If displaying just a single distribution, there's obviously no reason not to show the density (it's only a trivial rescaling of the axis tick values). When comparing multiple, the decision again depends on the point of the plot. Losing dynamic range for broader densities when compared with narrower ones can be counterproductive. E.g., in Bayesian parameter inference (where the data are MCMC samples), we almost never compute the actual normalization factor, but rather want to compare relative probabilities (i.e., within a single distribution) of different parameter values across different distributions. Of course, nothing forces one normalization over another for violin plots.
All of those are separate (and rectifiable!) issues from the defining characteristics of a violin plot:
1. Distributions displayed vertically rather than horizontally (both being harder to interpret and inappropriately suggestive). We almost exclusively visualize functions plotted vertically across a horizontal coordinate. I think this is the only valid, specific criticism of the common violin style itself, but the fix is of course trivial.
2. Horizontal violin are then only different from a ridge plot by a) not overlapping (which to me is a major improvement over standard ridge plots, but also trivially fixed) and b) being displayed symmetrically. I find it slightly easier to compare relative heights in the symmetric version, especially when comparing many distributions (such that each is relatively narrow). Even if not, the difference is so superficial/trivial that I don't find it worth arguing about.
Beyond this, the video's main argument (repeated every minute) seems to be that "it's bad, it's just bad", but there are only so many ways to make a 5 minute argument fill a 42 minute video. (This style of video is so grating to me.)
You couldn't pay me enough to be this opinionated about something so banal.
I've always liked the ggplot2 and the Grammar of Graphics approach to plotting so much so that I wrote my own DSL based on it - it is standalone, written in Rust, has WASM bindings (as you can see on the website) and more:
https://williamcotton.github.io/algraf
It pairs well with a related data translation DSL:
https://williamcotton.github.io/pdl
And you can see the two working together here:
https://williamcotton.github.io/datafarm-studio
There's LSPs for both, LSP clients for VS Code, and even language diagnostics for standalone Monaco editors in the browser.
Of note is that the same language diagnostics are exposed via the WASM as via the LSP interface allowing for the same friendly red squiggles to look and work the same in both your browser with Monaco and your editor with the LSP!
A year ago, added R to the pipeline (with multiple complications) just to use ggplot2 - even though Python was the main tech.
https://quesma.com/blog/sandboxing-ai-generated-code-why-we-...
Good, that ggplot2 can run inside in WASM, vide https://github.com/QuesmaOrg/webr-ggplot-playground
A big part of the motivation was that something like this...
...is just very slow. Booting R just to run ggplot2 was not cutting it compared to a custom DSL written in Rust!BTW, that "R on the command line" tool was inspired by:
https://datascienceatthecommandline.com
Nifty! What motivated you to create these tools?
I really like to build things that build other things!
It is great, but I have completely given up non-interactive plots since a while.
You get so much more information in plots using bokeh (or I assume plotly).
Tooltips, zooming, interaction.
And the LLM helps a lot when the plot is a bit more complex.
And it comes with tidyverse-like cheatsheet[1] that I confused with ggplot2 when first discovered plotnine
[1]: https://github.com/rstudio/cheatsheets/blob/main/plotnine.pd...
Sorry for the confusion. Though, it is a mango tree in a mango garden! The continued development and maintenance of plotnine is supported by Posit, PBC, the same company behind the Tidyverse.
Disclaimer: I am the author.
While it is good to have plotnine for python but ggplot2 in R has this whole ecosystem of packages/extensions that augment the ggplot2 such as ggalluvial, ggrepel. More at this link https://exts.ggplot2.tidyverse.org/gallery/ This would make it really compelling to tip my toes in the Python world. Regardless, I am liking this "ggplotification" of python. All we need is more "tidification" of Python data frame world.
If this was for JS and not Python I would be all over it for the upcoming NFL/NBA seasons and my stat visualizations
matplotlib is what most folks reach for by default, but I'd argue that's habit and popularity talking, not a real fit advantage. For a lot of plotting work, the grammar-of-graphics libraries — Altair, plotnine — may be the better tool for many people (and Agents); they just don't have the muscle memory behind them.
(Disclosure: I'm at Posit, which supports plotnine.)
What would you say are the benefits of the grammar-of-graphics approach? I've been working with plotting for more than a decade now and have never heard of it. Right now I'm looking through the gallery and can't really grasp what makes this approach better than the one in matplotlib.
PS. It took someone in the comments writing "import plotnine as p9" for me to understand it isn't plotLine.
It has to do with the structure of the information/inputs. Its more a DSL with a structure rather than a jumble of API arguments. So first few times out it may take some getting used to, but as that structure becomes clear it becomes more intuitive. Thats also why even with way more matplotlib examples in the world, LLM agents do better "guessing" how to use plotnine... or at least it was that way a few months ago when we were testing them head to head.
> What would you say are the benefits of the grammar-of-graphics approach?
When a mathematical formalism exists, just use that. Other approaches just reinvent the wheel on an ad-hoc/piecemeal basis and end up making all sorts of unnecessary compromises.
It supports plotnine via plot display window, do you mean?
Nice package! A side-by-side comparison with seaborn would be very nice to see.
I have used neither in quite a while now but there is an alternative from jetbrains that i started using because it shares the same ergonomics and had better (?) documentation.
https://lets-plot.org/python/
It seems like the major difference between plotnine and lets-plot is that plotnine wraps Matplotlib (and thus works everywhere Matplotlib is available, but doesn't offer much interactivity), while lets-plot is written in Kotlin and seems to provide interactive plots.
Very cool will have to check this out more.
Semi related -- I made a little d3.js AI wrapper that works pretty well for making quick charts -- https://prompt2chart.com/share/e998a3f6-9482-4c18-931f-a4513...; https://prompt2chart.com/;
Love plotnine when I switched over to python and great to see the project develop! But I have to admit I ended up switching to altair after all which has been my go to in python now.
Is the selling point for this vs e.g. Plotly just the ggplot style semantics?
The selling point is the grammar of graphics. See Wilkinson on the subject.
I always like to see Hadley Wickam’s masterpiece frameworks emerging around.
For another grammar-of-graphics-based visualization library (flexibly compose charts rather than simply pick a template), check out Altair https://altair-viz.github.io.
Nice one. I'm aware of plotnine because ggplot is more intuitive IMHO than say matplotlib.
2 things that would be awesome are interactive plots (hover + text box) and chlorpleth (tiled map) plots.
On closer look you have already nailed the latter!
There is this new and very promising package that adds interactivity to plotnine graphs https://y-sunflower.github.io/ninejs/.
Disclaimer: I am the author of plotnine.
See also Gribouille: A Grammar of Graphics for Typst, discussed here a week ago https://news.ycombinator.com/item?id=48541062.
After plotnine, with a solid & performant (more than the R versions) Python version of Purrr and Dplyr I might never reach for R again!
`from plotnine import *`
... I love the idea of a new python plotting library, but why is this anti-pattern so common with plotting libs?
While it’s generally considered to be bad practice to import everything into the global namespace, I think it’s fine to do this in an ad-hoc environment such as a notebook as it makes using the many functions plotnine provides more convenient. An additional advantage is that the resulting code more closely resembles the original ggplot2 code. Alternatively, it’s quite common to `import plotnine as p9` and prefix every function with `p9`.
Disclaimer: I made the plotnine homepage and cheatsheet.
The issue I find with this pattern in docs/tutorials is that the prefix makes it very obvious which functions are from this library.
It's particularly worthwhile when looking at the bigger examples that might involve another library, or stdlib functions/libs I haven't dealt with before
To each his own. Some drawbacks here are that it means that if you want to copy the code in, you have to add all the “p9”s. And, if you want to make a more complicated demo, with maybe a second import, you now have multiple conventions in your demo codebase.
> a new python plotting library
Whilst it's still not yet at 1.0.0, it's not that new: the first (0.1.0) release was in 2017: https://pypi.org/project/plotnine/#history
matplotlib's first release was in 2003, making it more than twice as old.
Because most of the time this will be used is not part of a software development project but rather producing publication plots in a script or plots in a notebook. Not what you would want to do when incorporating it into a web app.
Even in a notebook it's a pain... import plotnine as p9 would be nicer.
Because it's aimed at data scientists who would rather be using R...
Using operator overloading of "+" to configure the plot is... a choice.
Plotnine is heavily inspired by the ggplot2 library, which uses the + operator in the same way: https://ggplot2.tidyverse.org/#usage
It is! And that's kinda the point.
Like others mention, this is inspired by ggplot2, a Grammar of Graphics library. The whole idea is graphics are composed by adding "layers", not like layers on a canvas, but like pouring paint into a pot, then the library understands the content and paints it to the canvas.
Layers might be pure data, geometry (lines, points, ...), annotations, styles, axis, etc.
When you get familiar with it, it's much more natural way of describing plots, better composition and easier exploration
And a side note, R actually includes a pipe operator (as in Unix `|`), and said if they'd know about it at the time, they would have never used the `+` operator
How `plot(.. + ggsize(700, 300) + ..)` is superior to a keyword parameter `plot(.. , size = (700, 300), ..)`?
Because, in the latter case, you have to declare a function argument for /every possible option/ that you want your graphics API to expose, and you need to do this every time you add a new option.
On the other hand, declaring the options through composition means that the API for "plot" remains static, and adding/removing options can be done trivially without an API change.
Composition (rather than parameters) is also more flexible. Let's say you want to divide your plot into three sub-plots, two of which are 200x200, and another which is 200x400. How do you express this as a keyword parameter? In composition, you could do something like:
plot( ggsubplot(ggvsplit(ggsize(200,400), gghsplit(ggsize(200,200), ggsize(200,200)))) )
And how do you express that as arithmetic?
I can get as far as `plot() / 3` but then no idea how to proceed. I don't think overloading arithmetic is a very good way to express this.
Isn't this somewhat the point of python's override capability? You can exactly define what addition means here, just like you have to define what addition means for a matrix or another data structure.
Separately you might find this to be smelly design, but then you should remember that this at least has precedence ¯\_(ツ)_/¯
Honestly operator overloading is almost always a bad choice IMO. There are cases (e.g. matrix math) where it's the right choice because of semantic clarity, but the indirection it creates in exchange for readability is costly. It's always obvious when a function is being called, and how to navigate to the implementation, but the same is not true of overloaded operators.
Well, it's like most tools. For this particular use, does it give more than it takes? If so, use it.
Matrix multiplication? Yeah, everybody knows there's a function being called. And, if it was implemented right, the users almost never have to look at the implementation.
Many other possible uses? Nope. Just nope. Not worth it.
kwargs exists, and is rather more pythonic. Just pass the kwargs dict to a standard formatting function, et voila.
Sadly, this drive towards the API design that matplotlib resembles, the raison d'etre of this library is the unusual (imho, great) API
Back when I did a lot of data stuff I used ggplot in R because it seemed to be popular, but I was just copy/pasting examples. Then one day I finally started to "get it" and actually read the manual. Learning the grammar of graphics was like a super power. I got to the point I could open pretty much anything people sent me and visualise it in a matter of seconds.
Although I've used Python professionally a lot more than R, I still felt like R was better at this. Somehow opening files in Python always feels a bit more "heavy". I don't really know why, though.
How does it handle realtime plotting?
finally a mature matplotlib alternative?
Seems like plotnine renders plots using matplotlib and has matplotlib as a dependency: https://github.com/has2k1/plotnine/blob/f6f5cb424f38329c5267...
Plotnine is my favorite python viz package, because ggplot2 is just so good.
Altair and Bokeh are also quite good for interactive graphs, but plotnine is so ergonomic.
Wonderful!!!