Accessing NYC Open Data with Python + the Socrata Open Data API

nyctaxisecond.png

If the walls in NYC could talk, they would likely tell you a similar story as one you can glean from 311 complaints.

This story can be found on NYC Open Data where you can find datasets for everything from 311 complaints, to taxi information, to subway entrance locations, to motor vehicle collisions, and much more.


Shameless plug here for my semi-neglected side project called What's Wrong With My Building that uses a bunch of NYC Open Data sources.


You can either export a dataset as a file, or you can access it through the Socrata Open Data API, or SODA.

Data is available in a variety of formats like CSV, XML, JSON, as well as GeoJSON and Shapefiles for map data.

You can use SODA to access a lot of other open data sources as well, not just for NYC!

Search for them here.

For this demo we will look at the data for 311 complaints in NYC.

Setup

First you need an App token, which you can get by registering a profile here.

The SODA API does not throttle requests that are made with an App token, but if they decide you're being a greedy jerk, they will throttle you and you will start getting HTTP 429 response errors.


We will be using Python and the requests library to make the API requests.

Let's create a new virtual environment for the project and install requests in it.

mkvirtualenv opendataenv
pip install requests

311 complaints

You can find the data for 311 complaints from 2010 to the present here.

Visit this page and click 'API'.

311datasetclickapi.png

You can view the API docs for this dataset, and also get the API endpoint link - note the endpoint link because we will need it next.

In a new file, go ahead and import the requests library, as well as your app token and the endpoint link.

import requests
NYC_OPEN_DATA_API_KEY = 'xxxxxxxxxxx'
endpoint = 'https://data.cityofnewyork.us/resource/erm2-nwe9.json'

As always, I would caution everyone to keep their App tokens(or any api keys) in some kind of configuration file or as an environment variable outside of your program. I like Python-decouple for this.

Now we can make a basic GET request for the data, adding the App token to the request as a header.

headers={'X-App-Token': NYC_OPEN_DATA_API_KEY}
result = requests.get(endpoint, headers=headers)

Notice at the end of the API endpoint, I've used the .json extension, so the returned results will be in JSON format.

The requests library has a built-in JSON decoder, so you can simply call result.json() to access the data.

Let's look at the first row.

result.json()[0]

{'unique_key': '45451683',
 'created_date': '2020-01-23T01:59:26.000',
 'agency': 'NYPD',
 'agency_name': 'New York City Police Department',
 'complaint_type': 'Noise - Residential',
 'descriptor': 'Banging/Pounding',
 'location_type': 'Residential Building/House',
 'incident_zip': '10470',
 'incident_address': 'XXX EAST  XXX STREET',
 'street_name': 'EAST  XXX STREET',
 'cross_street_1': 'XXX AVENUE',
 'cross_street_2': 'XXX AVENUE',
 'intersection_street_1': 'XXX AVENUE',
 'intersection_street_2': 'XXX AVENUE',
 'city': 'BRONX',
 'landmark': 'EAST  XXX STREET',
 'status': 'In Progress',
 'community_board': '12 BRONX',
 'bbl': '2033820050',
 'borough': 'BRONX',
 'x_coordinate_state_plane': '1021274',
 'y_coordinate_state_plane': '266111',
 'open_data_channel_type': 'MOBILE',
 'park_facility_name': 'Unspecified',
 'park_borough': 'BRONX',
 'latitude': '40.89700924172494',
 'longitude': '-73.86607891380385',
 'location': {'latitude': '40.89700924172494',
  'longitude': '-73.86607891380385',
  'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'},
 ':@computed_region_efsh_h5xi': '11608',
 ':@computed_region_f5dn_yrer': '29',
 ':@computed_region_yeji_bk3q': '5',
 ':@computed_region_92fq_4b7q': '40',
 ':@computed_region_sbqj_enih': '30'}

Note: I've edited this data a bit - specifically street addresses and cross streets - just to make it a bit less easily identifiable.

In the docs you can find information about each of these fields and which data type each field is.

For example, the created_date field is a floating timestamp, and the agency_name field is a text field.

If you wanted to get this data in another format, such as XML, you could change the ending of the endpoint to '.xml'.

endpoint = 'https://data.cityofnewyork.us/resource/erm2-nwe9.xml'

And then parse the response in your preferred way.


Filtering data

You can filter data either by using Simple Filters or SoQL Queries.

Simple equality filters

To perform a simple equality filter, you just use the field name and value you want to filter by.

endpoint = 'https://data.cityofnewyork.us/resource/erm2-nwe9.json?complaint_type={}'.format("HEAT/HOT WATER")

The part after the '?' is the filter:

complaint_type=HEAT/HOT WATER

Here I am querying for complaint_type and am looking for complaints about heat and hot water.

Look at the data to see what the value looks like for whatever you're trying to filter by, so that you can match it correctly.

SoQL Queries

Socrata Query Language, or SoQL, is similar to SQL and allows you to create a query using parameters such as $select, $where, $order, $group, $having and others that you might be familiar with if you have used SQL.

The $q parameter can be used to perform a full text search for a value.

Check out the docs for more.

A couple of other parameters available are $limit and $offset which we will use in the next section on pagination.

Heat and hot water complaints in the last five days

If I wanted to filter for heat and hot water complaints that were created within the past five days, I would use a SoQL query.

from datetime import datetime,timedelta

five_days_ago = datetime.date(datetime.now()) - timedelta(5)

endpoint = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?$select=created_date,complaint_type,city,status&$where=created_date>='{}'&complaint_type={}".format(str(five_days_ago),"HEAT/HOT WATER")

The SoQL query is the part after the '?' in the URL.

$select=created_date,complaint_type,city,status&$where=created_date>='2020-01-20'&complaint_type=HEAT/HOT WATER

I used the timedelta function in Python's datetime module to find the date for 5 days earlier than datetime.now().

The $select clause tells it to return only the columns for created_date,complaint_type,city and status.

If you leave out the $select clause, it will just return all of the columns.

This $where clause tells it to return rows if the date is greater than or equal to January 20, 2020. I'm writing this on January 25th, so that is five days prior, as well as the complaint_type of heat and hot water.

Here is part of the result:

[...
{'created_date': '2020-01-20T17:23:19.000',
  'complaint_type': 'HEAT/HOT WATER',
  'city': 'BROOKLYN',
  'status': 'Closed'},
 {'created_date': '2020-01-20T17:23:56.000',
  'complaint_type': 'HEAT/HOT WATER',
  'city': 'BRONX',
  'status': 'Closed'},
 {'created_date': '2020-01-20T17:23:58.000',
  'complaint_type': 'HEAT/HOT WATER',
  'city': 'BROOKLYN',
  'status': 'Open'}]

Notice that I have the single quotes around the date in the URL. I used string formatting, but I need the actual quote marks in there as well.

If you try the request without the quotes around the date, you will get a 400 error for 'Invalid SoQL query'.

Pagination

I mentioned the $limit and $offset parameters previously, and we will use them here for pagination.

The $limit defaults to 1000 results.

For the query that we just did, we got back 1000 results, but that's not all of the results.

There used to be a limit of 50,000 records for the $limit clause, but as of version 2.1 of the API there is no limit and they advise you to pick a limit that is appropriate for your internet connection so that the request will not time out.

From the docs, when paging through data, you want to use the $order clause, which guarantees that the order of your results will remain stable as you page through. Otherwise the order of query results are not ordered in a particular way.

If you don't care about the order otherwise, you could just use $order=:id.

To get the next 1000 results, you will use an $offset of 1000, and as you page through, increase the offset by 1000 with each page you request.

$select=created_date,complaint_type,city,status&$where=created_date>='2020-01-20'&complaint_type=HEAT/HOT WATER&$order=:id&$offset=1000

The offset clause just lets the API know that you've already seen that number of results, and to return the next results starting after that.

You can read more about paging through data here.


GIS data

Lastly, I just wanted to mention that GeoJSON is another one of the formats that you can get back from the API.

I won't go into that here, but you can read the docs for more.

Thanks for reading

Let me know if you've used open data(NYC or otherwise) in any of your projects so I can check them out!

And of course, if you have any questions or comments, write them below or reach out to me on Twitter @LVNGD.

blog comments powered by Disqus

Recent Posts

mortonzcurve.png
Computing Morton Codes with a WebGPU Compute Shader
May 29, 2024

Starting out with general purpose computing on the GPU, we are going to write a WebGPU compute shader to compute Morton Codes from an array of 3-D coordinates. This is the first step to detecting collisions between pairs of points.

Read More
webgpuCollide.png
WebGPU: Building a Particle Simulation with Collision Detection
May 13, 2024

In this post, I am dipping my toes into the world of compute shaders in WebGPU. This is the first of a series on building a particle simulation with collision detection using the GPU.

Read More
abstract_tree.png
Solving the Lowest Common Ancestor Problem in Python
May 9, 2023

Finding the Lowest Common Ancestor of a pair of nodes in a tree can be helpful in a variety of problems in areas such as information retrieval, where it is used with suffix trees for string matching. Read on for the basics of this in Python.

Read More
Get the latest posts as soon as they come out!