I Analyzed 14 Years of BMW Sales Data Here's What I Found

For my third data analyst portfolio project, I picked a dataset that felt more real-world than most beginner projects: BMW global sales data spanning from 2010 to 2024. Over 10,000 records covering models, regions, fuel types, pricing, mileage, and sales volume.

Here's a full walkthrough of what I did, what I found, and what I learned along the way.

Why BMW Sales Data?

I wanted a dataset with a mix of numeric and categorical columns so I could practice different chart types properly — not just slap a line chart on everything. BMW sales data gave me exactly that: categories like Region, Fuel Type, and Transmission sitting alongside numbers like Price, Mileage, and Engine Size.

Tools Used

Python — core language
Pandas — data loading, cleaning, groupby analysis
Matplotlib — all charts and the final dashboard
Seaborn — heatmaps
Jupyter Notebook — development environment

Step 1 — Loading & Checking the Data

First things first — load the data and understand what we're working with.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv("bmw_sales_data.csv")
print("Shape:", data.shape)
data.head()

Then I checked for missing values two ways — with code and with a heatmap:

data.isnull().sum()

sns.heatmap(data.isnull())
plt.title("Missing Values Heatmap")
plt.show()

Result: Zero missing values. The dataset was completely clean, so I moved straight to analysis.

Step 2 — Descriptive Statistics

Before making any charts, I ran .describe() on both numeric and categorical columns to understand the ranges and distributions.

data.describe().round(2)           # numeric columns
data.describe(include="object")    # categorical columns

One important thing I learned: always check .info() first to confirm data types. A Year column stored as a string instead of an integer would break a time-series chart silently.

Step 3 — Correlation Heatmap

This was a new addition I added to this project. A correlation heatmap shows how all numeric columns relate to each other at once.

sns.heatmap(
    data.select_dtypes(include="number").corr().round(2),
    annot=True, cmap="coolwarm", linewidths=0.5
)
plt.title("Correlation Heatmap - Numeric Columns")
plt.show()

Finding: No strong correlations between any of the numeric columns. Price, Mileage, Engine Size, and Sales Volume all behave independently in this dataset.

Step 4 — Value Counts & Distribution Charts

This is where I learned one of the most important rules in data visualization: match the chart type to the data type.

Here's the rule I now follow:

Column Type	Chart to Use
Categorical (Model, Region, Fuel Type…)	Bar chart
Numeric (Price, Mileage, Engine Size…)	Histogram
Time-ordered (Year)	Line chart

In my earlier projects I was putting line charts on categorical columns, which doesn't make sense — a line implies order and continuity between points, and categories like "Diesel" and "Electric" have no natural order.

Example — correct chart for a categorical column:

value_count_Fuel_Type = data["Fuel_Type"].value_counts()

plt.bar(value_count_Fuel_Type.index, value_count_Fuel_Type.values,
        color="steelblue", edgecolor="black")
plt.title("Fuel Type Count Distribution")
plt.xlabel("Fuel Type")
plt.ylabel("Count")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

Example — correct chart for a numeric column:

plt.hist(data["Price_USD"], bins=30, color="steelblue", edgecolor="black")
plt.title("Price Distribution")
plt.xlabel("Price (USD)")
plt.ylabel("Count")
plt.show()

Step 5 — Groupby Analysis

This is where the real insights come from. I ran 8 groupby analyses comparing key column pairs.

Groupby example — Average Price by Region:

Price_by_region = data.groupby("Region")["Price_USD"].mean().round(2)

plt.bar(Price_by_region.index, Price_by_region.values,
        color="steelblue", edgecolor="black")
plt.title("Average Price by Region")
plt.xlabel("Region")
plt.ylabel("Average Price (USD)")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

Groupby example — Sales Volume trend over time:

sales_by_year = data.groupby("Year")["Sales_Volume"].mean().round(2)

plt.plot(sales_by_year.index, sales_by_year.values,
         marker="o", linestyle="--", color="green")
plt.title("Average Sales Volume by Year (2010–2024)")
plt.xlabel("Year")
plt.ylabel("Average Sales Volume")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

Step 6 — Advanced Charts

Beyond groupby, I added three more chart types that tell a different story:

Scatter Plot — Mileage vs Price: Shows whether high mileage cars are cheaper. Spoiler: not really.

plt.scatter(data["Mileage_KM"], data["Price_USD"], alpha=0.1, color="purple")
plt.title("Mileage vs Price")
plt.xlabel("Mileage (KM)")
plt.ylabel("Price (USD)")
plt.show()

Boxplot — Price by Transmission: Shows the spread and outliers of price within each transmission type.

data.boxplot(column="Price_USD", by="Transmission", grid=False)
plt.title("Price Distribution by Transmission")
plt.suptitle("")
plt.ylabel("Price (USD)")
plt.show()

Crosstab Heatmap — Region vs Fuel Type: Counts how many cars of each fuel type exist in each region. This was the most visually interesting chart in the whole project.

ct = pd.crosstab(data["Region"], data["Fuel_Type"])
sns.heatmap(ct, annot=True, fmt="d", cmap="YlGnBu")
plt.title("Region vs Fuel Type (Count)")
plt.show()

Step 7 — Final Dashboard

I combined the 9 most important charts into one 3×3 dashboard using plt.subplots.

fig, axes = plt.subplots(3, 3, figsize=(29, 24))
fig.suptitle("BMW DATA ANALYSIS REPORT (2010-2024)", fontsize=20, fontweight="bold")

# ... all 9 charts plotted on axes[row, col]

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

One bug I fixed here: when using sns.heatmap inside a subplot grid, you must pass ax=axes[row, col] — otherwise Seaborn draws the heatmap on a new figure instead of the correct subplot.

# Wrong:
sns.heatmap(ct, annot=True, fmt="d", cmap="YlGnBu")

# Correct:
sns.heatmap(ct, annot=True, fmt="d", cmap="YlGnBu", ax=axes[2, 2])

Key Findings

Sales volume was stable across all 14 years — no big spikes or crashes, suggesting BMW maintained consistent global demand
All regions show similar average prices — BMW holds its pricing globally without major regional discounts
Electric vehicles are a growing share of the inventory, reflecting BMW's push into EVs
Mileage and Price show no strong relationship — price is driven by model and classification, not mileage alone
Automatic and Manual transmission cars are priced similarly on average

What I Learned

Chart type selection matters more than I thought — wrong chart type = misleading visualization
Always pass ax= to Seaborn when using subplots
Histograms are the correct tool for continuous numeric columns, not bar charts of value counts
A correlation heatmap at the start saves a lot of time by showing you which relationships are worth investigating

What's Next

This was Project 3 of my data analyst portfolio. Project 4 is something I'm really excited about — I designed and launched my own survey collecting original data on Student Career Readiness in Nepal. No Kaggle dataset, real responses from real students.

Follow along on:

Thanks for reading. If you're also building a data analyst portfolio, drop your project link in the comments — I'd love to see what you're working on.

I Analyzed 14 Years of BMW Sales Data Here's What I Found

Why BMW Sales Data?

Tools Used

Step 1 — Loading & Checking the Data

Step 2 — Descriptive Statistics

Step 3 — Correlation Heatmap

Step 4 — Value Counts & Distribution Charts

Step 5 — Groupby Analysis

Step 6 — Advanced Charts

Step 7 — Final Dashboard

Key Findings

What I Learned

What's Next

Comments

More from this blog

Titanic Survival Analysis — What the Data Reveals About Who Lived and Who Died

From Raw Data to Real Insights — My First HR Data Analysis Project

How Python code really Runs : Understanding Bytecode and The Python Virtual Machine

Command Palette

Why BMW Sales Data?

Tools Used

Step 1 — Loading & Checking the Data

Step 2 — Descriptive Statistics

Step 3 — Correlation Heatmap

Step 4 — Value Counts & Distribution Charts

Step 5 — Groupby Analysis

Step 6 — Advanced Charts

Step 7 — Final Dashboard

Key Findings

What I Learned

What's Next

Comments

More from this blog