I Analyzed 14 Years of BMW Sales Data Here's What I Found

I’m Sandip Subedi, a science student focused on building real, long-term skills in Python, Machine Learning, and modern web technologies. I believe in learning fundamentals deeply rather than rushing through shortcuts or tutorials. My approach is simple: understand how systems work, practice consistently, and document the journey through writing. I’m steadily working toward becoming an industry-ready engineer who solves real problems with clarity and intent.
For my third data analyst portfolio project, I picked a dataset that felt more real-world than most beginner projects: BMW global sales data spanning from 2010 to 2024. Over 10,000 records covering models, regions, fuel types, pricing, mileage, and sales volume.
Here's a full walkthrough of what I did, what I found, and what I learned along the way.
Why BMW Sales Data?
I wanted a dataset with a mix of numeric and categorical columns so I could practice different chart types properly — not just slap a line chart on everything. BMW sales data gave me exactly that: categories like Region, Fuel Type, and Transmission sitting alongside numbers like Price, Mileage, and Engine Size.
Tools Used
Python — core language
Pandas — data loading, cleaning, groupby analysis
Matplotlib — all charts and the final dashboard
Seaborn — heatmaps
Jupyter Notebook — development environment
Step 1 — Loading & Checking the Data
First things first — load the data and understand what we're working with.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv("bmw_sales_data.csv")
print("Shape:", data.shape)
data.head()
Then I checked for missing values two ways — with code and with a heatmap:
data.isnull().sum()
sns.heatmap(data.isnull())
plt.title("Missing Values Heatmap")
plt.show()
Result: Zero missing values. The dataset was completely clean, so I moved straight to analysis.
Step 2 — Descriptive Statistics
Before making any charts, I ran .describe() on both numeric and categorical columns to understand the ranges and distributions.
data.describe().round(2) # numeric columns
data.describe(include="object") # categorical columns
One important thing I learned: always check .info() first to confirm data types. A Year column stored as a string instead of an integer would break a time-series chart silently.
Step 3 — Correlation Heatmap
This was a new addition I added to this project. A correlation heatmap shows how all numeric columns relate to each other at once.
sns.heatmap(
data.select_dtypes(include="number").corr().round(2),
annot=True, cmap="coolwarm", linewidths=0.5
)
plt.title("Correlation Heatmap - Numeric Columns")
plt.show()
Finding: No strong correlations between any of the numeric columns. Price, Mileage, Engine Size, and Sales Volume all behave independently in this dataset.
Step 4 — Value Counts & Distribution Charts
This is where I learned one of the most important rules in data visualization: match the chart type to the data type.
Here's the rule I now follow:
| Column Type | Chart to Use |
|---|---|
| Categorical (Model, Region, Fuel Type…) | Bar chart |
| Numeric (Price, Mileage, Engine Size…) | Histogram |
| Time-ordered (Year) | Line chart |
In my earlier projects I was putting line charts on categorical columns, which doesn't make sense — a line implies order and continuity between points, and categories like "Diesel" and "Electric" have no natural order.
Example — correct chart for a categorical column:
value_count_Fuel_Type = data["Fuel_Type"].value_counts()
plt.bar(value_count_Fuel_Type.index, value_count_Fuel_Type.values,
color="steelblue", edgecolor="black")
plt.title("Fuel Type Count Distribution")
plt.xlabel("Fuel Type")
plt.ylabel("Count")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()
Example — correct chart for a numeric column:
plt.hist(data["Price_USD"], bins=30, color="steelblue", edgecolor="black")
plt.title("Price Distribution")
plt.xlabel("Price (USD)")
plt.ylabel("Count")
plt.show()
Step 5 — Groupby Analysis
This is where the real insights come from. I ran 8 groupby analyses comparing key column pairs.
Groupby example — Average Price by Region:
Price_by_region = data.groupby("Region")["Price_USD"].mean().round(2)
plt.bar(Price_by_region.index, Price_by_region.values,
color="steelblue", edgecolor="black")
plt.title("Average Price by Region")
plt.xlabel("Region")
plt.ylabel("Average Price (USD)")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()
Groupby example — Sales Volume trend over time:
sales_by_year = data.groupby("Year")["Sales_Volume"].mean().round(2)
plt.plot(sales_by_year.index, sales_by_year.values,
marker="o", linestyle="--", color="green")
plt.title("Average Sales Volume by Year (2010–2024)")
plt.xlabel("Year")
plt.ylabel("Average Sales Volume")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()
Step 6 — Advanced Charts
Beyond groupby, I added three more chart types that tell a different story:
Scatter Plot — Mileage vs Price: Shows whether high mileage cars are cheaper. Spoiler: not really.
plt.scatter(data["Mileage_KM"], data["Price_USD"], alpha=0.1, color="purple")
plt.title("Mileage vs Price")
plt.xlabel("Mileage (KM)")
plt.ylabel("Price (USD)")
plt.show()
Boxplot — Price by Transmission: Shows the spread and outliers of price within each transmission type.
data.boxplot(column="Price_USD", by="Transmission", grid=False)
plt.title("Price Distribution by Transmission")
plt.suptitle("")
plt.ylabel("Price (USD)")
plt.show()
Crosstab Heatmap — Region vs Fuel Type: Counts how many cars of each fuel type exist in each region. This was the most visually interesting chart in the whole project.
ct = pd.crosstab(data["Region"], data["Fuel_Type"])
sns.heatmap(ct, annot=True, fmt="d", cmap="YlGnBu")
plt.title("Region vs Fuel Type (Count)")
plt.show()
Step 7 — Final Dashboard
I combined the 9 most important charts into one 3×3 dashboard using plt.subplots.
fig, axes = plt.subplots(3, 3, figsize=(29, 24))
fig.suptitle("BMW DATA ANALYSIS REPORT (2010-2024)", fontsize=20, fontweight="bold")
# ... all 9 charts plotted on axes[row, col]
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
One bug I fixed here: when using sns.heatmap inside a subplot grid, you must pass ax=axes[row, col] — otherwise Seaborn draws the heatmap on a new figure instead of the correct subplot.
# Wrong:
sns.heatmap(ct, annot=True, fmt="d", cmap="YlGnBu")
# Correct:
sns.heatmap(ct, annot=True, fmt="d", cmap="YlGnBu", ax=axes[2, 2])
Key Findings
Sales volume was stable across all 14 years — no big spikes or crashes, suggesting BMW maintained consistent global demand
All regions show similar average prices — BMW holds its pricing globally without major regional discounts
Electric vehicles are a growing share of the inventory, reflecting BMW's push into EVs
Mileage and Price show no strong relationship — price is driven by model and classification, not mileage alone
Automatic and Manual transmission cars are priced similarly on average
What I Learned
Chart type selection matters more than I thought — wrong chart type = misleading visualization
Always pass
ax=to Seaborn when using subplotsHistograms are the correct tool for continuous numeric columns, not bar charts of value counts
A correlation heatmap at the start saves a lot of time by showing you which relationships are worth investigating
What's Next
This was Project 3 of my data analyst portfolio. Project 4 is something I'm really excited about — I designed and launched my own survey collecting original data on Student Career Readiness in Nepal. No Kaggle dataset, real responses from real students.
Follow along on:
Thanks for reading. If you're also building a data analyst portfolio, drop your project link in the comments — I'd love to see what you're working on.

