Python Lecture 14: Working with Python Libraries - NumPy and Pandas
Welcome to an exciting lecture that opens the door to data science and scientific computing! Python's true power lies not just in the language itself, but in its vast ecosystem of libraries. Today we're exploring two of the most important libraries: NumPy for numerical computing and Pandas for data analysis. These libraries are fundamental tools used by data scientists, analysts, and developers worldwide.
Understanding Python libraries transforms you from writing everything from scratch to leveraging powerful, battle-tested tools built by experts. NumPy provides efficient array operations essential for scientific computing, machine learning, and data processing. Pandas offers intuitive data structures for working with structured data - the kind you'd find in spreadsheets and databases. Together, they form the foundation of Python's data science stack.
By the end of this comprehensive lecture, you'll understand how to install and import libraries, work with NumPy arrays for efficient numerical computations, use Pandas DataFrames for data manipulation and analysis, and apply these tools to solve real-world data problems. These skills are highly valuable in today's data-driven world. Let's dive in!
Understanding Python Libraries
A library is a collection of pre-written code that provides specific functionality. Instead of writing complex algorithms yourself, you import a library and use its functions. This is code reuse at scale - thousands of developers have contributed to these libraries, and you benefit from their work instantly.
Why Libraries Matter: Python's standard library is extensive, but third-party libraries expand capabilities enormously. Want to create websites? Use Django or Flask. Process images? Use Pillow. Perform machine learning? Use scikit-learn or TensorFlow. Libraries save time, provide tested code, and enable you to focus on your specific problem rather than reinventing common solutions.
Installing Libraries with pip: Python's package manager, pip, makes installing libraries simple. From your command line or terminal, run pip install library_name. For example: pip install numpy pandas. This downloads the library from PyPI (Python Package Index) and installs it in your Python environment. Understanding pip is essential for working with Python professionally.
Introduction to NumPy - Numerical Python
NumPy (Numerical Python) is the fundamental package for scientific computing in Python. It provides a powerful N-dimensional array object and tools for working with these arrays. NumPy arrays are much faster and more memory-efficient than Python lists for numerical operations.
Why NumPy Over Lists: Python lists are flexible but slow for numerical operations. NumPy arrays are homogeneous (all elements same type), stored contiguously in memory, and optimized with compiled C code. This makes NumPy operations 10-100x faster than equivalent list operations. For data science and numerical computing, NumPy is essential.
The ndarray Object: NumPy's core is the ndarray (n-dimensional array) - a table of elements all of the same type, indexed by a tuple of non-negative integers. 1D arrays are like lists, 2D arrays are like matrices, 3D and higher dimensions represent more complex data structures. Understanding ndarrays is fundamental to numerical Python.
# First, install: pip install numpy
import numpy as np
# Creating NumPy arrays
arr1 = np.array([1, 2, 3, 4, 5])
print("1D array:", arr1)
# 2D array (matrix)
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("2D array:\n", arr2)
# Array properties
print("Shape:", arr2.shape) # (2, 3) - 2 rows, 3 columns
print("Size:", arr2.size) # 6 - total elements
print("Data type:", arr2.dtype) # int64 or int32
# Creating special arrays
zeros = np.zeros((3, 4)) # 3x4 array of zeros
ones = np.ones((2, 3)) # 2x3 array of ones
identity = np.eye(3) # 3x3 identity matrix
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
print("Zeros:\n", zeros)
print("Range array:", range_arr)
NumPy Array Operations
NumPy's power comes from vectorized operations - operations applied to entire arrays at once without explicit loops. This is both faster and more readable than iterating through lists.
import numpy as np
# Arithmetic operations (vectorized)
arr = np.array([1, 2, 3, 4, 5])
print("Original:", arr)
print("Add 10:", arr + 10)
print("Multiply by 2:", arr * 2)
print("Square:", arr ** 2)
# Array-to-array operations
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print("arr1 + arr2:", arr1 + arr2)
print("arr1 * arr2:", arr1 * arr2)
# Statistical operations
data = np.array([10, 20, 30, 40, 50])
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Std deviation:", np.std(data))
print("Sum:", np.sum(data))
print("Min:", np.min(data))
print("Max:", np.max(data))
# Boolean operations
arr = np.array([1, 2, 3, 4, 5])
print("Greater than 3:", arr > 3)
print("Elements > 3:", arr[arr > 3]) # Filtering!
Vectorization Power: Instead of writing for i in range(len(arr)): arr[i] = arr[i] * 2, NumPy lets you write arr * 2. This is not just shorter - it's dramatically faster for large arrays. Always prefer vectorized operations over loops when working with NumPy.
Introduction to Pandas - Data Analysis Library
Pandas is built on NumPy and provides high-level data structures and analysis tools. The two primary structures are Series (1-dimensional) and DataFrame (2-dimensional). DataFrames are like Excel spreadsheets or SQL tables - labeled rows and columns of potentially different types.
Why Pandas: While NumPy handles numerical arrays well, real-world data is messy - missing values, mixed types, labeled columns. Pandas handles this gracefully. It's designed for practical data analysis: loading data from files, cleaning messy data, filtering and grouping, merging datasets, and statistical analysis.
Series vs DataFrame: A Series is a single column with an index. A DataFrame is a table - multiple columns, each potentially a different type. Think of a DataFrame as a dictionary of Series objects, or a spreadsheet where each column is a Series. Understanding this relationship helps you work effectively with Pandas.
# First, install: pip install pandas
import pandas as pd
import numpy as np
# Creating a Series
series = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("Series:")
print(series)
print("\nAccess 'b':", series['b'])
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'London', 'Paris', 'Tokyo'],
'Salary': [70000, 80000, 75000, 85000]
}
df = pd.DataFrame(data)
print("\nDataFrame:")
print(df)
# DataFrame properties
print("\nShape:", df.shape)
print("Columns:", df.columns.tolist())
print("Data types:\n", df.dtypes)
Working with DataFrames
DataFrames support a wide range of operations for data exploration, cleaning, and analysis. Understanding these operations is key to effective data manipulation.
import pandas as pd
# Creating sample data
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [25, 30, 35, 28, 32],
'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
'Salary': [70000, 60000, 75000, 65000, 68000]
})
# Selecting columns
print("Just names:", df['Name'])
print("\nMultiple columns:")
print(df[['Name', 'Salary']])
# Selecting rows by index
print("\nFirst 3 rows:")
print(df.head(3))
# Filtering data
it_employees = df[df['Department'] == 'IT']
print("\nIT Employees:")
print(it_employees)
high_earners = df[df['Salary'] > 65000]
print("\nHigh earners:")
print(high_earners)
# Statistical summary
print("\nStatistics:")
print(df.describe())
# Group by operations
dept_avg_salary = df.groupby('Department')['Salary'].mean()
print("\nAverage salary by department:")
print(dept_avg_salary)
# Adding new column
df['Bonus'] = df['Salary'] * 0.1
print("\nWith bonus:")
print(df)
Real-World Application - Sales Analysis: Companies use Pandas to analyze sales data. Load transactions from CSV, group by product/region/time period, calculate totals and averages, identify trends, find top performers, and generate reports. All of this is straightforward with Pandas - operations that would take hundreds of lines in pure Python become single Pandas commands.
📚 Related Python Tutorials:
Reading and Writing Data
Pandas excels at reading data from various formats and writing results back. This is essential for practical data analysis workflows.
import pandas as pd
# Reading CSV files
# df = pd.read_csv('data.csv')
# Reading Excel files
# df = pd.read_excel('data.xlsx')
# Creating sample data
df = pd.DataFrame({
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'Price': [999, 25, 75, 350],
'Quantity': [10, 50, 30, 15]
})
# Writing to CSV
df.to_csv('products.csv', index=False)
# Writing to Excel
df.to_excel('products.xlsx', index=False)
# Reading JSON
# df = pd.read_json('data.json')
# Writing to JSON
df.to_json('products.json', orient='records')
print("Data saved successfully!")
# Reading back
df_loaded = pd.read_csv('products.csv')
print("\nLoaded data:")
print(df_loaded)
Practical Data Analysis Example
import pandas as pd
import numpy as np
# Creating sample sales data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100)
products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor'] * 25
sales_data = pd.DataFrame({
'Date': dates,
'Product': products,
'Quantity': np.random.randint(1, 20, 100),
'Price': np.random.randint(20, 1000, 100)
})
sales_data['Total'] = sales_data['Quantity'] * sales_data['Price']
print("=== Sales Analysis ===\n")
# Total revenue
total_revenue = sales_data['Total'].sum()
print(f"Total Revenue: ${total_revenue:,.2f}\n")
# Revenue by product
product_revenue = sales_data.groupby('Product')['Total'].sum().sort_values(ascending=False)
print("Revenue by Product:")
print(product_revenue)
print()
# Best selling product
best_product = product_revenue.idxmax()
print(f"Best selling product: {best_product}\n")
# Average transaction value
avg_transaction = sales_data['Total'].mean()
print(f"Average transaction: ${avg_transaction:.2f}\n")
# Monthly summary
sales_data['Month'] = sales_data['Date'].dt.month
monthly_sales = sales_data.groupby('Month')['Total'].sum()
print("Monthly Sales:")
print(monthly_sales)
Summary
Python libraries extend capabilities dramatically. You've learned:
✓ Installing libraries with pip
✓ NumPy arrays for numerical computing
✓ Vectorized operations for efficiency
✓ Pandas Series and DataFrames
✓ Data manipulation and analysis
✓ Reading and writing data files
✓ Real-world data analysis workflows
Practice Challenge: Create a student grade analysis system. Store student data in a DataFrame (names, subjects, scores). Calculate averages per student and per subject, identify top performers, find students needing help, generate statistical summaries, and save results to CSV. This combines NumPy and Pandas skills!

