Skip to main content

Command Palette

Search for a command to run...

I Analyzed 14 Years of BMW Sales Data Here's What I Found

Updated
6 min read

I Analyzed 14 Years of BMW Sales Data  Here's What I Found
S

I’m Sandip Subedi, a science student focused on building real, long-term skills in Python, Machine Learning, and modern web technologies. I believe in learning fundamentals deeply rather than rushing through shortcuts or tutorials. My approach is simple: understand how systems work, practice consistently, and document the journey through writing. I’m steadily working toward becoming an industry-ready engineer who solves real problems with clarity and intent.

For my third data analyst portfolio project, I picked a dataset that felt more real-world than most beginner projects: BMW global sales data spanning from 2010 to 2024. Over 10,000 records covering models, regions, fuel types, pricing, mileage, and sales volume.

Here's a full walkthrough of what I did, what I found, and what I learned along the way.


Why BMW Sales Data?

I wanted a dataset with a mix of numeric and categorical columns so I could practice different chart types properly — not just slap a line chart on everything. BMW sales data gave me exactly that: categories like Region, Fuel Type, and Transmission sitting alongside numbers like Price, Mileage, and Engine Size.


Tools Used

  • Python — core language

  • Pandas — data loading, cleaning, groupby analysis

  • Matplotlib — all charts and the final dashboard

  • Seaborn — heatmaps

  • Jupyter Notebook — development environment


Step 1 — Loading & Checking the Data

First things first — load the data and understand what we're working with.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv("bmw_sales_data.csv")
print("Shape:", data.shape)
data.head()

Then I checked for missing values two ways — with code and with a heatmap:

data.isnull().sum()
sns.heatmap(data.isnull())
plt.title("Missing Values Heatmap")
plt.show()

Result: Zero missing values. The dataset was completely clean, so I moved straight to analysis.


Step 2 — Descriptive Statistics

Before making any charts, I ran .describe() on both numeric and categorical columns to understand the ranges and distributions.

data.describe().round(2)           # numeric columns
data.describe(include="object")    # categorical columns

One important thing I learned: always check .info() first to confirm data types. A Year column stored as a string instead of an integer would break a time-series chart silently.


Step 3 — Correlation Heatmap

This was a new addition I added to this project. A correlation heatmap shows how all numeric columns relate to each other at once.

sns.heatmap(
    data.select_dtypes(include="number").corr().round(2),
    annot=True, cmap="coolwarm", linewidths=0.5
)
plt.title("Correlation Heatmap - Numeric Columns")
plt.show()

Finding: No strong correlations between any of the numeric columns. Price, Mileage, Engine Size, and Sales Volume all behave independently in this dataset.


Step 4 — Value Counts & Distribution Charts

This is where I learned one of the most important rules in data visualization: match the chart type to the data type.

Here's the rule I now follow:

Column Type Chart to Use
Categorical (Model, Region, Fuel Type…) Bar chart
Numeric (Price, Mileage, Engine Size…) Histogram
Time-ordered (Year) Line chart

In my earlier projects I was putting line charts on categorical columns, which doesn't make sense — a line implies order and continuity between points, and categories like "Diesel" and "Electric" have no natural order.

Example — correct chart for a categorical column:

value_count_Fuel_Type = data["Fuel_Type"].value_counts()

plt.bar(value_count_Fuel_Type.index, value_count_Fuel_Type.values,
        color="steelblue", edgecolor="black")
plt.title("Fuel Type Count Distribution")
plt.xlabel("Fuel Type")
plt.ylabel("Count")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

Example — correct chart for a numeric column:

plt.hist(data["Price_USD"], bins=30, color="steelblue", edgecolor="black")
plt.title("Price Distribution")
plt.xlabel("Price (USD)")
plt.ylabel("Count")
plt.show()

Step 5 — Groupby Analysis

This is where the real insights come from. I ran 8 groupby analyses comparing key column pairs.

Groupby example — Average Price by Region:

Price_by_region = data.groupby("Region")["Price_USD"].mean().round(2)

plt.bar(Price_by_region.index, Price_by_region.values,
        color="steelblue", edgecolor="black")
plt.title("Average Price by Region")
plt.xlabel("Region")
plt.ylabel("Average Price (USD)")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

Groupby example — Sales Volume trend over time:

sales_by_year = data.groupby("Year")["Sales_Volume"].mean().round(2)

plt.plot(sales_by_year.index, sales_by_year.values,
         marker="o", linestyle="--", color="green")
plt.title("Average Sales Volume by Year (2010–2024)")
plt.xlabel("Year")
plt.ylabel("Average Sales Volume")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

Step 6 — Advanced Charts

Beyond groupby, I added three more chart types that tell a different story:

Scatter Plot — Mileage vs Price: Shows whether high mileage cars are cheaper. Spoiler: not really.

plt.scatter(data["Mileage_KM"], data["Price_USD"], alpha=0.1, color="purple")
plt.title("Mileage vs Price")
plt.xlabel("Mileage (KM)")
plt.ylabel("Price (USD)")
plt.show()

Boxplot — Price by Transmission: Shows the spread and outliers of price within each transmission type.

data.boxplot(column="Price_USD", by="Transmission", grid=False)
plt.title("Price Distribution by Transmission")
plt.suptitle("")
plt.ylabel("Price (USD)")
plt.show()

Crosstab Heatmap — Region vs Fuel Type: Counts how many cars of each fuel type exist in each region. This was the most visually interesting chart in the whole project.

ct = pd.crosstab(data["Region"], data["Fuel_Type"])
sns.heatmap(ct, annot=True, fmt="d", cmap="YlGnBu")
plt.title("Region vs Fuel Type (Count)")
plt.show()

Step 7 — Final Dashboard

I combined the 9 most important charts into one 3×3 dashboard using plt.subplots.

fig, axes = plt.subplots(3, 3, figsize=(29, 24))
fig.suptitle("BMW DATA ANALYSIS REPORT (2010-2024)", fontsize=20, fontweight="bold")

# ... all 9 charts plotted on axes[row, col]

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

One bug I fixed here: when using sns.heatmap inside a subplot grid, you must pass ax=axes[row, col] — otherwise Seaborn draws the heatmap on a new figure instead of the correct subplot.

# Wrong:
sns.heatmap(ct, annot=True, fmt="d", cmap="YlGnBu")

# Correct:
sns.heatmap(ct, annot=True, fmt="d", cmap="YlGnBu", ax=axes[2, 2])

Key Findings

  • Sales volume was stable across all 14 years — no big spikes or crashes, suggesting BMW maintained consistent global demand

  • All regions show similar average prices — BMW holds its pricing globally without major regional discounts

  • Electric vehicles are a growing share of the inventory, reflecting BMW's push into EVs

  • Mileage and Price show no strong relationship — price is driven by model and classification, not mileage alone

  • Automatic and Manual transmission cars are priced similarly on average


What I Learned

  1. Chart type selection matters more than I thought — wrong chart type = misleading visualization

  2. Always pass ax= to Seaborn when using subplots

  3. Histograms are the correct tool for continuous numeric columns, not bar charts of value counts

  4. A correlation heatmap at the start saves a lot of time by showing you which relationships are worth investigating


What's Next

This was Project 3 of my data analyst portfolio. Project 4 is something I'm really excited about — I designed and launched my own survey collecting original data on Student Career Readiness in Nepal. No Kaggle dataset, real responses from real students.

Follow along on:


Thanks for reading. If you're also building a data analyst portfolio, drop your project link in the comments — I'd love to see what you're working on.