Arc Diagrams in D3.js: Visualizing Taxi Pickup and Dropoff Data

Arc Diagram Graphic

An arc diagram is a special type of network graph.

A graph is made up of nodes that represent some type of entity or object, and the links that connect them.

In an arc diagram, the nodes all lie along one axis, and the links between them are drawn as arcs.

Ride Hailing Apps in NYC

This arc diagram visualizes rides from the ride hailing apps Uber, Lyft, Via and Gett / Juno in NYC.

  • Each ride starts in a borough or at an airport, and ends in a borough or at an airport.
  • The nodes in this graph are the boroughs and airports, which are the circles in the middle of the diagram.
  • The links connect two nodes, from the pickup location to the dropoff location.
  • Link thickness is based on the count of rides between the two nodes, so you can see here that the link from Manhattan to Brooklyn is thicker than the link from Manhattan to the Bronx because there were more rides between them.

We will build this arc diagram in two parts.

Part One: Data Preparation

In this post I'm going to go over data collection and preparation using Python and Pandas.

import pandas as pd

There are three data sources that we need for this project, so we will be doing some data wrangling and ultimately end up with a dataset representing the nodes and links that will make up the graph.

Then we will output the data in JSON format and write it to a file.

Part Two: D3.js Visualization

In the next post, we will take the JSON data and actually create the arc diagram in D3.js.

Stay tuned for that!


Three Data Sources

We will be using three data sources for this project.

  1. For Hire Vehicle trip data

    • Each row in this dataset represents an individual ride.
    • The full dataset has around 40 million rows.
  2. Taxi zones

    • NYC is divided up into numbered geographic zones, and they are used in the first dataset to indicate pickup and dropoff locations for a ride.
  3. Base dispatch license numbers

    • These indicate the company such as Uber or Lyft, and we are using them as a filter on the first dataset to only fetch the ride data for the ride-hailing apps.

For Hire Vehicle Trip Data

Uber, Lyft, and other apps fall under the For Hire Vehicle (FHV) umbrella, and there is data for these trips on NYC Open Data.

This dataset has approximately 41 million rows with one for each trip recorded, and about 20 million of them are for the big ride hailing apps that are covered in this visualization.

Aside from the ride hailing apps, there are a bunch of smaller dispatch companies that make up the rest of the 41 million rows.

It does not include yellow or green taxi trip data.

To create this graphic, I fetched all 20 million of the relevant rows from the API, but for this post, we will fetch a much smaller number.

There are six columns in this dataset

  • Dispatching base license number
  • Pickup datetime
  • Dropoff datetime
  • Pickup location ID - corresponds to a taxi zone.
  • Dropoff location ID - also corresponds to a taxi zone.
  • There is a shared ride flag for shared rides, which is a feature in some apps.

As mentioned, each company, e.g. Uber, has a base license number, which we will use to filter the data and only fetch rows corresponding to one of the apps from the API.

So we need the base license numbers for Uber, Lyft, Via, and Juno / Gett, which I will talk about in a bit.

The pickup and dropoff location IDs are integers, and they correspond to taxi zones.


Taxi Zones

Basically the city is divided into numbered zones that correspond to different geographic areas, and this data provides more information like the borough or airport name for each of the numbered zones.

In the FHV trip data, the pickup and dropoff location IDs correspond to a numbered taxi zone, so we will use the zone number as a key to map the borough or airport name back to the trip data.

The data for these zones can be found on NYC Open Data.

You can download it in CSV format and then load into a Pandas DataFrame.

zones = pd.read_csv(path_to_downloaded_file)

We are only concerned with three columns: LocationID, zone, and borough.

zones[['LocationID', 'zone', 'borough']].head()

   LocationID                     zone        borough
0  1           Newark Airport           EWR          
1  2           Jamaica Bay              Queens       
2  3           Allerton/Pelham Gardens  Bronx        
3  4           Alphabet City            Manhattan    
4  5           Arden Heights            Staten Island
  • The location ID is a number that represents a geographic area.
  • Zones: you can see here one of the zones is Newark Airport, and there are zones for LaGuardia and JFK airports as well. The rest are NYC neighborhoods, which I am not concerned with for this visualization.
  • The borough column is self explanatory - for Newark Airport you can see that they have EWR in that column (the airport code), but the other two airports are in Queens and have that as their borough.

Create a new column with Pandas apply()

I'm going to create a new column using the Pandas apply function that either has the borough name the zone is in, or the zone name if it is one of the airports.

def zone_location(row):
    airports = ['JFK Airport', 'LaGuardia Airport', 'Newark Airport']
    if row['zone'] in airports:
        return row['zone']
    return row['borough']


zones['borough_airport'] = zones.apply(zone_location,axis=1)

You can see that this new column has either the airport name or the borough name.

zones['borough_airport'].head()

0    Newark Airport
1    Queens        
2    Bronx         
3    Manhattan     
4    Staten Island 
Name: borough_airport, dtype: object

Mapping the borough or airport back to the FHV trip records

Later we will want to map this column to the pickup and dropoff location IDs in the FHV trip data, so that we have the borough or airport information for each ride's pickup and dropoff.

So I am going to create a Python dictionary with the location ID column as keys and the borough/airport as values.

zones_dict = dict(zip(zones.LocationID, zones.borough_airport))

We will come back to this after we've fetched the FHV trip data from NYC Open Data.


Base Dispatch License Numbers

As mentioned, we will be filtering the API request for FHV trip data by the base license numbers.

I found the data for the base license numbers for the various ride hailing apps in a PDF file.

uber = ['B02764','B02879','B02765','B02598','B02404','B02870','B02864','B02512','B02865','B02617','B02875','B02682','B02869','B02866', 'B02865', 'B02395','B02876', 'B02869', 'B02882', 'B02866', 'B02877', 'B02875', 'B02883', 'B02864', 'B02888', 'B02871', 'B02889', 'B02835', 'B02884', 'B02836', 'B02872', 'B02887', 'B02878', 'B02867', 'B02879', 'B02870', 'B02880']
via = ['B02800','B03136']
gett_juno = ['B02907', 'B02908', 'B02914','B03035']
lyft = ['B02510', 'B02844'] 

base_nums = []

for b in [uber,via,gett_juno,lyft]:
    base_nums.extend(b)

I kept the companies separate just in case I want to do further analysis at some point, and then created a list for all of them in base_nums which we will use to fetch the trip data from NYC Open Data.

As mentioned earlier, we are going to use the list base_nums as a filter so that we only fetch the trip data for the ride hailing apps.

Fetching the FHV trip data

Now we are ready to fetch the trip records from the API, which we will do with Python requests and load it into a Pandas DataFrame.

First you need an API key.

import requests

NYC_OPEN_DATA_API_KEY='your_api_key_here'

Building a SoQL query with the IN operator

  • Note, if you google SoQL you will probably find results about Salesforce, but the Socrata Query Language is also called SoQL and that is what we are using here.

We are only going to fetch the records with base license numbers in the base_nums list we compiled earlier.

To do this we will build a SoQL query using the IN operator which is similar to the same operator in SQL, and similar to functionality in Python where you check if an item is in a list.

Format the query string

First we have to format the list base_nums into a string that will create a valid API query.

def format_dispatch_base_num(base_nums):
    base_num_str = '"{}",'*(len(base_nums)-1) + '"{}"'
    base_num_string = base_num_str.format(*base_nums)
    return base_num_string

dispatch_base_num = format_dispatch_base_num(base_nums)

Just pass in the base_nums list to the function, and the resulting string looks like this:

'"B02764","B02879","B02765","B02598","B02404","B02870","B02864","B02512","B02865","B02617","B02875","B02682","B02869","B02866","B02865","B02395","B02876","B02869","B02882","B02866","B02877","B02875","B02883","B02864","B02888","B02871","B02889","B02835","B02884","B02836","B02872","B02887","B02878","B02867","B02879","B02870","B02880","B02800","B03136","B02907","B02908","B02914","B03035","B02510","B02844"'

And then we will just use Python string formatting to pop that into the API request endpoint.

The column name for the taxi base numbers is dispatching_base_num.

endpoint = 'https://data.cityofnewyork.us/resource/u6nh-b56h.json?$where=dispatching_base_num IN({})&$limit=50000'.format(dispatch_base_num)

We're fetching 50,000 records in this post with $limit=50000, but feel free to tweak this to fetch the full 20 million-ish rows if you'd like!

And now we are ready to send the request to the API and load the JSON response into a DataFrame.

headers={'X-App-Token': NYC_OPEN_DATA_API_KEY}
result = requests.get(endpoint, headers=headers)
df = pd.DataFrame(result.json())

Look at the first few rows.

df.head()

  dispatching_base_num dolocationid         dropoff_datetime          pickup_datetime pulocationid sr_flag
0  B02395               48           2019-01-01T00:48:41.000  2019-01-01T00:15:25.000  90           NaN   
1  B02395               144          2019-01-01T01:16:07.000  2019-01-01T00:50:26.000  48           1     
2  B02395               144          2019-01-01T01:16:07.000  2019-01-01T00:51:20.000  48           2     
3  B02395               112          2019-01-01T00:23:34.000  2019-01-01T00:17:02.000  112          NaN   
4  B02395               145          2019-01-01T00:37:54.000  2019-01-01T00:28:24.000  112          NaN   

I'm going to drop any rows that have NaN values in the dispatching_base_num, dolocationid, or pulocationid columns.

df.dropna(subset=['dolocationid', 'pulocationid', 'dispatching_base_num'],inplace=True)

This is where the zones_dict from earlier comes in.

I'm going to create a couple of new columns for the pickup and dropoff locations that will hold the string value of either the borough or airport name from zones_dict from the corresponding location ID.

First convert the two location ID columns to type integer.

df['dolocationid'] = df['dolocationid'].astype('int')
df['pulocationid'] = df['pulocationid'].astype('int')

The new columns are named do_name and pu_name respectively, and we map the values from zones_dict to them based on the location ID.

df['do_name'] = df['dolocationid'].map(zones_dict)
df['pu_name'] = df['pulocationid'].map(zones_dict)

Grouping the data to prepare it for the arc diagram

We're going to do a groupby on the pu_name and do_name columns and get the count in each group, which is the number of rides.

df.groupby(['pu_name', 'do_name'])['pu_name'].count()

The first few groups look like this:

pu_name            do_name          
Bronx              Bronx                8801 
                   Brooklyn             103  
                   JFK Airport          43   
                   LaGuardia Airport    64   
                   Manhattan            2177 
                   Newark Airport       2    
                   Queens               269  
                   Staten Island        1    
Brooklyn           Bronx                61   
                   Brooklyn             9030 
                   JFK Airport          235  
                   LaGuardia Airport    103  
                   Manhattan            1296 
                   Newark Airport       25   
                   Queens               821  
                   Staten Island        40   
JFK Airport        Bronx                21   
                   Brooklyn             147  
                   JFK Airport          7    
                   LaGuardia Airport    19   
                   Manhattan            272  
                   Newark Airport       3    
                   Queens               105  
                   Staten Island        1 

For each pickup-dropoff pair, we have a count of the number of rides.

Matrix graph representation

Let's use unstack to pivot the inner index do_name to columns, and just fill in any NaN values with 0.

df.groupby(['pu_name', 'do_name'])['pu_name'].count().unstack(fill_value=0)

do_name            Bronx  Brooklyn  JFK Airport  LaGuardia Airport  Manhattan  Newark Airport  Queens  Staten Island
pu_name                                                                                                             
Bronx              8801   103       43           64                 2177       2               269     1            
Brooklyn           61     9030      235          103                1296       25              821     40           
JFK Airport        21     147       7            19                 272        3               105     1            
LaGuardia Airport  17     55        4            2                  192        1               50      0            
Manhattan          1609   1007      416          390                11701      183             693     15           
Queens             178    885       246          219                942        8               5761    2            
Staten Island      1      25        3            0                  9          3               2       233  

This is our graph, represented as a matrix, that we will be visualizing as an arc diagram.

Pandas has a to_json method that we can use to easily convert the data and then write it to a file - check out the docs here.

We will also reset the index before generating the JSON.

Putting it all together

I used unstack just to show you the matrix, but we don't need to do that.

df_pu_counts = df.groupby(['pu_name', 'do_name'])['pu_name'].count().reset_index(name="count")
df_pu_counts.to_json(path_or_buf='taxi_data.json', orient="records")

There are a few ways to orient the JSON, which you can read about in the docs, but here we are using records to get the columns and their values in the format that we will use with D3.js.

We wrote the JSON data to a file with path_or_buf which you can just pass any file path to.

Here are a few of the JSON objects - I didn't copy and paste all of it here.

[
{"pu_name":"Bronx","do_name":"Bronx","count":8801},
{"pu_name":"Bronx","do_name":"Brooklyn","count":103},
{"pu_name":"Bronx","do_name":"JFK Airport","count":43}
...
]

These are the links in the graph - the link from pickup to dropoff and then the ride count, which will be used to set the thickness of the arcs.

Thanks for reading!

Stay tuned for Part II where we will actually create the arc diagram in D3.js.

Let me know if you have any questions or comments either in the comments below, or on Twitter @LVNGD.

blog comments powered by Disqus

Recent Posts

abstract_tree.png
Solving the Lowest Common Ancestor Problem in Python
May 9, 2023

Finding the Lowest Common Ancestor of a pair of nodes in a tree can be helpful in a variety of problems in areas such as information retrieval, where it is used with suffix trees for string matching. Read on for the basics of this in Python.

Read More
rectangles_cover.png
How to write a custom fragment shader in GLSL and use it with three.js
April 16, 2023

This blog post walks through the process of writing a fragment shader in GLSL, and using it within the three.js library for working with WebGL. We will render a visually appealing grid of rotating rectangles that can be used as a website background.

Read More
streaming data
Streaming data with Flask and Fetch + the Streams API
April 10, 2023

Streaming can be a great way to transfer and process large amounts of data. It can help save space and/or time, if the data uses a lot of memory, or if you want to start processing or visualizing the data as it comes in.

Read More
Get the latest posts as soon as they come out!