SQL-Pandas: A better love story than Twilight

blogging
jupyter
Python
Back to Basics
Author

Kashish Mukheja

Published

Saturday, 27 January 2024

A Report by …

@Author: Kashish Mukheja

Introduction

mysqldatabase.jpg

In this report, we analyze NYC flight datasets and the project explores the translation of SQL queries into pandas DataFrame operations, aiming to understand the capability and equivalence of pandas in performing tasks traditionally handled by SQL.

Import Libraries and Setting up the SQLite Database Connection

We’ll start by importing the necessary libraries and establishing a connection with the SQLite database

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sqlite3
# Establishing a connection with a new SQLite database
conn = sqlite3.connect('nycflights13.db')

Exporting CSV Files to the Database

Next, we’ll export each CSV file to the SQLite database, creating a separate table for each dataset

# Define file paths
csv_files = {
    'flights': 'nycflights13_flights.csv',
    'airlines': 'nycflights13_airlines.csv',
    'airports': 'nycflights13_airports.csv',
    'planes': 'nycflights13_planes.csv',
    'weather': 'nycflights13_weather.csv'
}

# Exporting CSV files to SQLite database
for table_name, file_path in csv_files.items():
    # Read CSV file into a pandas DataFrame
    df = pd.read_csv(file_path)
    
    # Export DataFrame to SQLite database
    df.to_sql(table_name, conn, index=False, if_exists='replace')

By executing the code above, we have successfully established a connection with a new SQLite database named nycflights13.db and exported all the provided CSV files into separate tables within this database.

# Read all CSV files into pandas DataFrames
flights_df = pd.read_csv('nycflights13_flights.csv')
airlines_df = pd.read_csv('nycflights13_airlines.csv')
airports_df = pd.read_csv('nycflights13_airports.csv')
planes_df = pd.read_csv('nycflights13_planes.csv')
weather_df = pd.read_csv('nycflights13_weather.csv')
weather_df.head(1) # 3rd column is month
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour
0 EWR 2013 1 1 0 37.04 21.92 53.97 230.0 10.35702 11.918651 0.0 1013.9 10.0 2013-01-01 1:00:00
weather_df.iloc[:,3]
0         1
1         1
2         1
3         1
4         1
         ..
26125    30
26126    30
26127    30
26128    30
26129    30
Name: day, Length: 26130, dtype: int64
for x in range(0,10,2):
    print(x)
0
2
4
6
8
my_tuple = (1,)
my_tuple, len(my_tuple)
((1,), 1)
my_tuple_2 = tuple([1])
my_tuple_2, len(my_tuple_2)
((1,), 1)
my_tuple_3 = (1)

Performing Data Analysis in Pandas Using SQL Columns

Reading Unique Engine Value from Planes

# Selecting distinct values of the "engine" column from the "planes" DataFrame
task_sql_distinct_engine = pd.read_sql_query("SELECT DISTINCT engine FROM planes;", conn)

# Pandas equivalent
task_my_distinct_engine = planes_df['engine'].unique()
task_my_distinct_engine = pd.DataFrame(task_my_distinct_engine, columns=['engine'])

# Check if the results match with the SQL query
if task_sql_distinct_engine.equals(task_my_distinct_engine):
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!
  • The SQL query selects distinct values of the “engine” column from the “planes” table, ensuring that each engine type appears only once in the result.
  • The pandas equivalent achieves the same by using the unique() function on the “engine” column of the planes_df DataFrame.
  • The comparison checks if the results obtained from the SQL query and the pandas equivalent are identical. If they are, it confirms that the pandas solution produces the same results as the SQL query, and if not, it indicates a discrepancy.

Reading Unique Type and Engine from Planes

# Selecting distinct values of the "type" and "engine" columns from the "planes" DataFrame
task_sql_distinct_type_engine = pd.read_sql_query("SELECT DISTINCT type, engine FROM planes;", conn)

# Pandas equivalent
task_my_distinct_type_engine = planes_df[['type', 'engine']].drop_duplicates()

# Reset index of both DataFrames
task_sql_distinct_type_engine_reset = task_sql_distinct_type_engine.reset_index(drop=True)
task_my_distinct_type_engine_reset = task_my_distinct_type_engine.reset_index(drop=True)

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_distinct_type_engine_reset, task_my_distinct_type_engine_reset) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

SQL Query and Pandas Equivalent:

The SQL query selects distinct combinations of values from the type and engine columns of the planes table. The pandas equivalent achieves the same by selecting the same columns from the planes_df DataFrame and then dropping duplicate rows.

Resetting Index: The index of both DataFrames is reset to ensure alignment for accurate comparison.

Comparison: The results obtained from the SQL query and pandas are compared using pd.testing.assert_frame_equal(). If the DataFrames are equal, it prints “Results match with the SQL query!”; otherwise, it prints “Results do not match with the SQL query.”

Count Occurrences of Each Engine Type

# Perform group by on 'engine' column and count the occurrences
task_sql_count_engine = pd.read_sql_query("SELECT COUNT(*), engine FROM planes GROUP BY engine;", conn)

# Pandas equivalent with column renaming and reordering
task_my_count_engine = planes_df.groupby('engine').size().reset_index(name='COUNT(*)')
task_my_count_engine = task_my_count_engine[['COUNT(*)', 'engine']]  # Reorder columns to match SQL query result

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_count_engine, task_my_count_engine) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Performing GroupBy and Counting Occurrences:

The SQL query selects the count of occurrences for each unique value in the “engine” column from the “planes” table after grouping by the “engine” column. The pandas equivalent achieves the same result by performing a GroupBy operation on the “engine” column of the planes_df DataFrame and then using the size() function to count the occurrences within each group. The result is stored in a new DataFrame with the column renamed to ’COUNT(*)’.

Column Renaming and Reordering: After renaming the column to ’COUNT(*)’ in the pandas DataFrame, we reorder the columns to match the order in the SQL query result. This ensures consistency in column names and order between the SQL query result and the pandas DataFrame.

Comparison: Finally, we compare the results obtained from the SQL query and the pandas operation using pd.testing.assert_frame_equal(). If the DataFrames are equal, it prints “Results match with the SQL query!”; otherwise, it prints “Results do not match with the SQL query.”

Count Occurrences of Each Engine and Type Combination

# Perform group by on 'engine' and 'type' columns and count the occurrences
task_sql_count_engine_type = pd.read_sql_query("SELECT COUNT(*), engine, type FROM planes GROUP BY engine, type;", conn)

# Pandas equivalent with column renaming and reordering
task_my_count_engine_type = planes_df.groupby(['engine', 'type']).size().reset_index(name='COUNT(*)')
task_my_count_engine_type = task_my_count_engine_type[['COUNT(*)', 'engine', 'type']]  # Reorder columns to match SQL query result

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_count_engine_type, task_my_count_engine_type) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Performing GroupBy and Counting Occurrences: The SQL query selects the count of occurrences for each unique combination of values in the “engine” and “type” columns from the “planes” table after grouping by both columns. The pandas equivalent achieves the same result by performing a GroupBy operation on both the “engine” and “type” columns of the planes_df DataFrame and then using the size() function to count the occurrences within each group. The result is stored in a new DataFrame with the column renamed to ’COUNT(*)’.

Column Renaming and Reordering: After renaming the column to ’COUNT(*)’ in the pandas DataFrame, we reorder the columns to match the order in the SQL query result. This ensures consistency in column names and order between the SQL query result and the pandas DataFrame.

Comparison: We compare the results obtained from the SQL query and the pandas operation using pd.testing.assert_frame_equal(). If the DataFrames are equal, it prints “Results match with the SQL query!”; otherwise, it prints “Results do not match with the SQL query.”

Aggregate Year Statistics by Engine and Manufacturer

# Perform group by on 'engine' and 'manufacturer' columns and calculate minimum, average, and maximum year
task_sql_engine_manufacturer_year = pd.read_sql_query("""
    SELECT MIN(year), AVG(year), MAX(year), engine, manufacturer 
    FROM planes 
    GROUP BY engine, manufacturer;
""", conn)

# Pandas equivalent with groupby and aggregation, and column reordering
task_my_engine_manufacturer_year = planes_df.groupby(['engine', 'manufacturer']).agg(
    min_year=('year', 'min'),
    avg_year=('year', 'mean'),
    max_year=('year', 'max')
).reset_index()

# Reorder columns to match desired order
task_my_engine_manufacturer_year = task_my_engine_manufacturer_year[['min_year', 'avg_year', 'max_year', 'engine', 'manufacturer']]

task_my_engine_manufacturer_year.columns = ['MIN(year)', 'AVG(year)', 'MAX(year)', 'engine', 'manufacturer']

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_engine_manufacturer_year, task_my_engine_manufacturer_year) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Performing GroupBy and Aggregation:

The SQL query groups the data by the engine and manufacturer columns and calculates the minimum, average, and maximum values of the year column within each group. The pandas equivalent achieves the same result by performing a GroupBy operation on the engine and manufacturer columns of the planes_df DataFrame. Then, the agg() function is used to specify the aggregation functions for the year column, including ‘min’, ‘mean’, and ‘max’. The result is stored in a new DataFrame with the specified column names for the aggregated values.

Column Reordering and Renaming: After computing the aggregated statistics, the columns in the pandas DataFrame are reordered to match the desired order of “MIN(year)”, “AVG(year)”, “MAX(year)”, “engine”, and “manufacturer”. Additionally, the column names are renamed to match the specified names.

Comparison: We compare the results obtained from the SQL query and the pandas operation using pd.testing.assert_frame_equal(). If the DataFrames are equal, it prints “Results match with the SQL query!”; otherwise, it prints “Results do not match with the SQL query.”

Filtering Rows with Non-null Speed Values

The code aims to retrieve rows from a dataset where the ‘speed’ column is not null

# Selecting rows where 'speed' column is not null
task_sql_speed_not_null = pd.read_sql_query("SELECT * FROM planes WHERE speed IS NOT NULL;", conn)

# Pandas equivalent
task_my_speed_not_null = planes_df[planes_df['speed'].notnull()].reset_index(drop=True)

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_speed_not_null, task_my_speed_not_null) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

This code snippet demonstrates how to filter rows with non-null values in a specific column, both in SQL and pandas, and verifies that the results match.

Summary:

  • The SQL query selects all columns from the “planes” table where the ‘speed’ column is not null.
  • The pandas equivalent achieves the same result by using boolean indexing to filter the DataFrame based on the condition that the ‘speed’ column is not null. The resulting DataFrame’s index is reset to ensure consistency.
  • Finally, the code compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

Filtering Planes by Seat Capacity and Year

The code aims to retrieve the tail numbers (‘tailnum’) of planes from a dataset where the seat capacity (‘seats’) falls within the range of 150 to 210 and the manufacturing year (‘year’) is greater than or equal to 2011.

# Selecting 'tailnum' from planes_df where 'seats' are between 150 and 210 and 'year' is greater than or equal to 2011
task_sql_filtered_planes = pd.read_sql_query("""
    SELECT tailnum FROM planes
    WHERE seats BETWEEN 150 AND 210 AND year >= 2011;
""", conn)

# Pandas equivalent
task_my_filtered_planes = planes_df[(planes_df['seats'].between(150, 210)) & (planes_df['year'] >= 2011)][['tailnum']].reset_index(drop=True)

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_filtered_planes, task_my_filtered_planes) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary:

  • The SQL query selects the ‘tailnum’ column from the “planes” table where the ‘seats’ column falls within the range of 150 to 210 and the ‘year’ column is greater than or equal to 2011.
  • The pandas equivalent achieves the same result by applying boolean indexing to the planes_df DataFrame. It filters rows where the ‘seats’ column falls within the specified range and where the ‘year’ column is greater than or equal to 2011. Only the ‘tailnum’ column is selected in the resulting DataFrame, and the index is reset for consistency.
  • Finally, the code compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to filter rows based on multiple conditions, both in SQL and pandas, and verifies that the results match.

Filtering Planes by Manufacturer and Seat Capacity

The code aims to retrieve specific information about planes from a dataset, including the tail number (‘tailnum’), manufacturer, and seat capacity (‘seats’). The filtering criteria include selecting planes manufactured by “BOEING”, “AIRBUS”, or “EMBRAER”, with a seat capacity greater than 390.

# Selecting 'tailnum', 'manufacturer', and 'seats' from planes_df where 'manufacturer' is in ("BOEING", "AIRBUS", "EMBRAER") and 'seats' is greater than 390
task_sql_filtered_planes = pd.read_sql_query("""
    SELECT tailnum, manufacturer, seats FROM planes
    WHERE manufacturer IN ("BOEING", "AIRBUS", "EMBRAER") AND seats > 390;
""", conn)

# Pandas equivalent
manufacturers = ["BOEING", "AIRBUS", "EMBRAER"]
task_my_filtered_planes = planes_df[(planes_df['manufacturer'].isin(manufacturers)) & (planes_df['seats'] > 390)][['tailnum', 'manufacturer', 'seats']].reset_index(drop=True)

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_filtered_planes, task_my_filtered_planes) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary: - The SQL query selects the ‘tailnum’, ‘manufacturer’, and ‘seats’ columns from the “planes” table where the ‘manufacturer’ column is in (“BOEING”, “AIRBUS”, “EMBRAER”) and the ‘seats’ column is greater than 390. - The pandas equivalent achieves the same result by using boolean indexing with the isin() method to filter rows where the ‘manufacturer’ column is in the specified list of manufacturers and the ‘seats’ column is greater than 390. Only the selected columns (‘tailnum’, ‘manufacturer’, and ‘seats’) are retained in the resulting DataFrame, and the index is reset for consistency. - Finally, the code compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to filter rows based on multiple conditions involving categorical values and numeric values, both in SQL and pandas, and verifies that the results match.

Selecting Distinct Year and Seats with Ordering

The code aims to retrieve distinct combinations of ‘year’ and ‘seats’ from a dataset of planes, where the ‘year’ is greater than or equal to 2012. Additionally, it orders the results by ‘year’ in ascending order and ‘seats’ in descending order.

# Selecting distinct 'year' and 'seats' from planes_df where 'year' is greater than or equal to 2012
task_sql_distinct_year_seats = pd.read_sql_query("""
    SELECT DISTINCT year, seats FROM planes
    WHERE year >= 2012 ORDER BY year ASC, seats DESC;
""", conn)

# Pandas equivalent
task_my_distinct_year_seats = planes_df[planes_df['year'] >= 2012][['year', 'seats']].drop_duplicates().sort_values(by=['year', 'seats'], ascending=[True, False]).reset_index(drop=True)

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_distinct_year_seats, task_my_distinct_year_seats) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary:

  • The SQL query selects distinct ‘year’ and ‘seats’ values from the “planes” table where the ‘year’ is greater than or equal to 2012. It orders the results by ‘year’ in ascending order and ‘seats’ in descending order.
  • The pandas equivalent achieves the same result by first filtering the planes_df DataFrame to include only rows where the ‘year’ is greater than or equal to 2012. Then, it selects only the ‘year’ and ‘seats’ columns, drops duplicate rows, and sorts the DataFrame by ‘year’ in ascending order and ‘seats’ in descending order. The index is reset for consistency.
  • Finally, the code compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to select distinct values of multiple columns, filter rows based on a condition, and order the results using both SQL and pandas. It also verifies that the results obtained from pandas match those from the SQL query.

Selecting Distinct Year and Seats with Ordering

The code aims to retrieve distinct combinations of ‘year’ and ‘seats’ from a dataset of planes, where the ‘year’ is greater than or equal to 2012. Additionally, it orders the results by ‘seats’ in descending order and ‘year’ in ascending order.

# Selecting distinct 'year' and 'seats' from planes_df where 'year' is greater than or equal to 2012
task_sql_distinct_year_seats = pd.read_sql_query("""
    SELECT DISTINCT year, seats FROM planes
    WHERE year >= 2012 ORDER BY seats DESC, year ASC;
""", conn)

# Pandas equivalent
task_my_distinct_year_seats = planes_df[planes_df['year'] >= 2012][['year', 'seats']].drop_duplicates().sort_values(by=['seats', 'year'], ascending=[False, True]).reset_index(drop=True)

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_distinct_year_seats, task_my_distinct_year_seats) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary:

  • The SQL query selects distinct ‘year’ and ‘seats’ values from the “planes” table where the ‘year’ is greater than or equal to 2012. It orders the results by ‘seats’ in descending order and ‘year’ in ascending order.
  • The pandas equivalent achieves the same result by first filtering the planes_df DataFrame to include only rows where the ‘year’ is greater than or equal to 2012. Then, it selects only the ‘year’ and ‘seats’ columns, drops duplicate rows, and sorts the DataFrame by ‘seats’ in descending order and ‘year’ in ascending order. The index is reset for consistency.
  • Finally, the code compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to select distinct values of multiple columns, filter rows based on a condition, and order the results using both SQL and pandas. It also verifies that the results obtained from pandas match those from the SQL query.

Counting Planes by Manufacturer with Seat Capacity Condition

The code aims to count the number of planes manufactured by each manufacturer where the seat capacity is greater than 200.

# Performing group by on 'manufacturer' and counting the occurrences where 'seats' is greater than 200
task_sql_count_manufacturer = pd.read_sql_query("""
    SELECT manufacturer, COUNT(*) FROM planes
    WHERE seats > 200 GROUP BY manufacturer;
""", conn)

# Pandas equivalent
task_my_count_manufacturer = planes_df[planes_df['seats'] > 200].groupby('manufacturer').size().reset_index(name='COUNT(*)')

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_count_manufacturer, task_my_count_manufacturer) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary:

  • The SQL query groups the planes by ‘manufacturer’ and counts the occurrences where the seat capacity (‘seats’) is greater than 200.
  • The pandas equivalent achieves the same result by first filtering the planes_df DataFrame to include only rows where the seat capacity is greater than 200. Then, it performs a group by on the ‘manufacturer’ column and calculates the size of each group to get the count. The result is stored in a DataFrame with columns ‘manufacturer’ and ’COUNT(*)’.
  • Finally, the code compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to perform grouping and aggregation operations in both SQL and pandas to obtain the count of planes by manufacturer, considering a specific condition on the seat capacity. It also verifies that the results obtained from pandas match those from the SQL query.

Counting Planes by Manufacturer with Count Condition

The code aims to count the number of planes manufactured by each manufacturer and filter the results to include only those manufacturers with a count greater than 10.

# Performing group by on 'manufacturer', counting the occurrences, and filtering where count is greater than 10
task_sql_count_manufacturer = pd.read_sql_query("""
    SELECT manufacturer, COUNT(*) FROM planes
    GROUP BY manufacturer HAVING COUNT(*) > 10;
""", conn)

# Pandas equivalent
manufacturer_counts = planes_df.groupby('manufacturer').size().reset_index(name='COUNT(*)')
task_my_count_manufacturer = manufacturer_counts[manufacturer_counts['COUNT(*)'] > 10].reset_index(drop=True)

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_count_manufacturer, task_my_count_manufacturer) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary:

  • The SQL query groups the planes by ‘manufacturer’, counts the occurrences for each group, and filters the results to include only those where the count is greater than 10.
  • The pandas equivalent first performs a group by on the ‘manufacturer’ column and calculates the size of each group to get the count of planes for each manufacturer. Then, it filters the resulting DataFrame to include only those manufacturers where the count is greater than 10. The index is reset for consistency.
  • Finally, the code compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to perform grouping, aggregation, and filtering operations in both SQL and pandas to obtain the count of planes by manufacturer, considering a condition on the count. It also verifies that the results obtained from pandas match those from the SQL query.

Counting Planes by Manufacturer with Seat Capacity Condition and Count Condition

The code aims to count the number of planes manufactured by each manufacturer, considering only those with a seat capacity greater than 200, and then filter the results to include only manufacturers with a count greater than 10.

# Performing group by on 'manufacturer', counting the occurrences, and filtering where count is greater than 10
task_sql_count_manufacturer = pd.read_sql_query("""
    SELECT manufacturer, COUNT(*) FROM planes
    WHERE seats > 200 GROUP BY manufacturer HAVING COUNT(*) > 10;
""", conn)

# Pandas equivalent
manufacturer_counts = planes_df[planes_df['seats'] > 200].groupby('manufacturer').size().reset_index(name='COUNT(*)')
task_my_count_manufacturer = manufacturer_counts[manufacturer_counts['COUNT(*)'] > 10].reset_index(drop=True)

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_count_manufacturer, task_my_count_manufacturer) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary:

  • The SQL query first filters the planes where seat capacity (‘seats’) is greater than 200. Then, it groups the planes by ‘manufacturer’, counts the occurrences for each group, and filters the results to include only those where the count is greater than 10.
  • The pandas equivalent achieves the same result by first filtering the planes_df DataFrame to include only rows where the seat capacity is greater than 200. Then, it performs a group by on the ‘manufacturer’ column and calculates the size of each group to get the count of planes for each manufacturer. Finally, it filters the resulting DataFrame to include only those manufacturers where the count is greater than 10. The index is reset for consistency.
  • The code then compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to perform grouping, aggregation, and filtering operations in both SQL and pandas to obtain the count of planes by manufacturer, considering both a condition on seat capacity and a count condition. It also verifies that the results obtained from pandas match those from the SQL query.

Top 10 Manufacturers by Plane Count

The code aims to identify the top 10 manufacturers based on the count of planes they have produced.

# Performing group by on 'manufacturer', counting the occurrences, and sorting in descending order
task_sql_top_manufacturers = pd.read_sql_query("""
    SELECT manufacturer, COUNT(*) AS howmany 
    FROM planes
    GROUP BY manufacturer
    ORDER BY howmany DESC LIMIT 10;
""", conn)

# Pandas equivalent
manufacturer_counts = planes_df.groupby('manufacturer').size().reset_index(name='howmany')
task_my_top_manufacturers = manufacturer_counts.sort_values(by='howmany', ascending=False).head(10).reset_index(drop=True)

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_top_manufacturers, task_my_top_manufacturers) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary: - The SQL query groups the planes by ‘manufacturer’, counts the occurrences for each group, and sorts the results in descending order of count. It then selects the top 10 manufacturers based on this count. - The pandas equivalent achieves the same result by first performing a group by on the ‘manufacturer’ column and calculating the size of each group to get the count of planes for each manufacturer. The count column is renamed to ‘howmany’. Then, it sorts the DataFrame in descending order of the count and selects the top 10 rows. The index is reset for consistency. - Finally, the code compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to find the top 10 manufacturers by plane count using both SQL and pandas. It also verifies that the results obtained from pandas match those from the SQL query.

Merging Flights Data with Plane Information

The code aims to combine information from the flights table with additional data about planes, including year, speed, and seats, obtained from the planes table.

# SQL Query
task_sql_flights_planes = pd.read_sql_query("""
    SELECT
        flights.*,
        planes.year AS plane_year,
        planes.speed AS plane_speed,
        planes.seats AS plane_seats
    FROM flights
    LEFT JOIN planes ON flights.tailnum=planes.tailnum;
""", conn)

# Pandas Equivalent
planes_subset = planes_df[['tailnum', 'year', 'speed', 'seats']].copy()
planes_subset.columns = ['tailnum', 'plane_year', 'plane_speed', 'plane_seats']
task1_my_flights_planes = pd.merge(flights_df, planes_subset, on='tailnum', how='left')

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_flights_planes, task1_my_flights_planes) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary: - SQL Query: The SQL query selects all columns from the ‘flights’ table and adds columns for ‘year’, ‘speed’, and ‘seats’ from the ‘planes’ table using a LEFT JOIN operation. - Pandas Equivalent: The pandas equivalent first creates a subset of columns from the ‘planes_df’ DataFrame, including ‘tailnum’, ‘year’, ‘speed’, and ‘seats’. The column names are then renamed to match those specified in the SQL query. This subset is merged with the ‘flights_df’ DataFrame based on the common column ‘tailnum’, using a left join operation. - The code then compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to merge data from two tables using both SQL and pandas, and it verifies that the results obtained from pandas match those from the SQL query.

Merging Plane and Airline Data

The code aims to combine information from the planes and airlines tables using data from the flights table as a bridge. Additionally, it reorders the columns in the merged DataFrame to match a specified order.

# SQL Query
task_sql_planes_airlines = pd.read_sql_query("""
    SELECT planes.*, airlines.* FROM
    (SELECT DISTINCT carrier, tailnum FROM flights) AS cartail
    INNER JOIN planes ON cartail.tailnum=planes.tailnum
    INNER JOIN airlines ON cartail.carrier=airlines.carrier;
""", conn)

# Pandas Equivalent
cartail = flights_df[['carrier', 'tailnum']].drop_duplicates()
merged_data = pd.merge(cartail, planes_df, on='tailnum', how='inner')
task_my_planes_airlines = pd.merge(merged_data, airlines_df, on='carrier', how='inner')

# Sorting by tailnum
task_my_planes_airlines = task_my_planes_airlines.sort_values(by=['tailnum', 'carrier']).reset_index(drop=True)

# Reordering columns in task_my_planes_airlines DataFrame
task_my_planes_airlines = task_my_planes_airlines[['tailnum', 'year', 'type', 'manufacturer', 'model', 'engines', 'seats', 'speed', 'engine', 'carrier', 'name']]

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_planes_airlines, task_my_planes_airlines) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary:

  • SQL Query: The SQL query selects all columns from the ‘planes’ and ‘airlines’ tables, inner joining them with a subquery that selects distinct combinations of ‘carrier’ and ‘tailnum’ from the ‘flights’ table. The join conditions ensure that ‘tailnum’ matches between ‘cartail’ (the result of the subquery) and ‘planes’, and ‘carrier’ matches between ‘cartail’ and ‘airlines’.
  • Pandas Equivalent: The pandas equivalent first creates a subset of unique combinations of ‘carrier’ and ‘tailnum’ from the ‘flights_df’ DataFrame. This subset is merged with the ‘planes_df’ DataFrame based on ‘tailnum’ and then with the ‘airlines_df’ DataFrame based on ‘carrier’. The resulting DataFrame is sorted by ‘tailnum’ and ‘carrier’. Finally, the columns are reordered to match the specified order.
  • The code then compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to merge data from multiple tables using both SQL and pandas, reorder the columns in the resulting DataFrame, and verify the equivalence of results obtained from pandas and SQL queries.

Merging Flight and Weather Data for EWR Airport

The purpose of this code is to combine flight information from the flights table with weather data from the weather table for flights originating from EWR airport, and calculating the mean temperature and humidity, and finally comparing the results from the SQL dataframe and pandas dataframe.

# SQL Query
task_sql_flight_weather = pd.read_sql_query("""
    SELECT flights2.*, atemp, ahumid
    FROM (
        SELECT * FROM flights WHERE origin='EWR'
    ) AS flights2 
    LEFT JOIN (
        SELECT year, month, day, AVG(temp) AS atemp, AVG(humid) AS ahumid
        FROM weather
        WHERE origin='EWR'
        GROUP BY year, month, day
    ) AS weather2
    ON flights2.year=weather2.year AND flights2.month=weather2.month AND flights2.day=weather2.day
""", conn)

# Pandas Equivalent with column renaming
ewr_flights = flights_df[flights_df['origin'] == 'EWR'].copy()
weather_ewr = weather_df[weather_df['origin'] == 'EWR'].groupby(['year', 'month', 'day']).agg({'temp': 'mean', 'humid': 'mean'}).reset_index()
weather_ewr.rename(columns={'temp': 'atemp', 'humid': 'ahumid'}, inplace=True)  # Rename columns
merged_data = pd.merge(ewr_flights, weather_ewr, on=['year', 'month', 'day'], how='left')

# Check if the results match with the SQL query
if pd.testing.assert_frame_equal(task_sql_flight_weather, merged_data) is None:
    print("Results match with the SQL query!")
else:
    print("Results do not match with the SQL query.")
Results match with the SQL query!

Summary:

  • SQL Query: The SQL query selects flight data for flights originating from EWR and left joins it with aggregated weather data for EWR airport, averaging temperature and humidity values by year, month, and day. The resulting columns are named ‘atemp’ and ‘ahumid’.
  • Pandas Equivalent with Column Renaming: In the pandas equivalent, flights and weather data are filtered for EWR airport, and the temperature and humidity columns are averaged by year, month, and day. These columns are then renamed to ‘atemp’ and ‘ahumid’, respectively, using the rename() method. The dataframes are merged based on year, month, and day using a left join operation.
  • The code then compares the results obtained from pandas with those from the SQL query using pd.testing.assert_frame_equal() to ensure their equivalence.

This code snippet demonstrates how to merge flight and weather data using pandas, specifically for flights originating from EWR airport, while renaming the columns, and it verifies the equivalence of results obtained from pandas and SQL queries.

Conclusion

The translation of SQL queries to pandas DataFrame operations provides valuable insights into pandas’ effectiveness in data manipulation tasks. By ensuring equivalence between SQL and pandas results, we validate pandas as a robust tool for data analysis, offering flexibility and ease of use comparable to traditional SQL queries. This project underscores the importance of understanding pandas’ capabilities for proficient data analysis and reinforces its role as a powerful tool in the data science toolkit.

Thank You for Reading the Report

Back to top