Exploring NYC - Analysis of Crime Data in New York City

Using NYPD Complaint Data to analyze crime events in different boroughs.

Jeff, Lu Chia-Ching
9 min readNov 7, 2019
Photo by Mark Asthoff on Unsplash

Introduction

When I moved to New York City, I kept hearing different people saying the same thing: I should watch out for the city in terms of personal safety. Even though, historically, NYC has become a much safer place than before (The troll tourist guide about New York in the 1970s: ‘Welcome to Fear City’ — the inside story of New York’s civil war, 40 years on), New York City leaves a full-of-fun but somewhat-terrifying image to the locals and people all around the world.
E.g. Here is the total crime by year from 1985 to 2014, showing the decrease in NYC crime rate since the 80s:

data from UCR Statistics

Despite hearing all of these talks, I don’t think I ever heard anyone (even native New Yorkers), bring up any kind of data sources or statistics to prove their overall idea of borough safety. Being an analytics student, I had an itch to dive into this topic and actually figure out what the true reality of this situation is.

What is the distribution of Crime in each borough?

Since this is a broad topic that could be analyzed in various aspects, I decided to focus on my most pressing question: Do the different boroughs have significant differences in terms of crime level and crime type?

People in NYC always say that some boroughs are safer than others. However, there’s a disproportionate amount of people who live in Manhattan and Brooklyn with fewer living in Staten Island and Queens, so it is difficult to assume that any of these boroughs are safer in terms of different populations. Having this in mind, I decided to dive into this question using various analytical techniques.

New York City has five boroughs: Manhattan, Bronx, Brooklyn, Queens & Staten Island. Photo by nycmap360.com

Dataset Used

For this project, I used the NYPD Complaint Data from NYCOpenData. It contains all the NYC history crime data reported by NYPD through 2019, and I extracted the 2018 data from it to avoid any missing data in 2019. This exploratory analysis is conducted with Python and Tableau.

Data Pre-processing

First Look of the Data

  • number of observations: 452,997
  • number of variables: 37 (including happening date and time, crime description, geographic data, suspect/victim data)

Re-categorize Crime Description to Crime Type

After taking a good look at this data and removing NaN data, I realized that this dataset is not ideal for analysis, even though it contains all kinds of variables such as crime happening time and even the suspect’s data for each event. The main reason is that crime type data is really confusing.

E.g. At one event, OFNS_DESC column could be “HARRASSMENT 2” but PD_DESC column could be “HARASSMENT,SUBD 1,CIVILIAN”, which makes it really hard to be analyzed)

The U.S. Department of Justice administers two statistical programs: Uniform Crime Reporting (UCR) Program & National Crime Victimization Survey (NCVS). However, NYPD seems to have its own approach in recording the crime event, so I can’t interpret the data with an established method. Additionally, there are multiple columns that explain the same crime event with some confusing description.

Therefore, I decided to re-categorize the type of crime event based on these two systems and my understanding. After this, I managed to limit the number of unique types of crime to 21 (original dataset had at least 59 types) and make them more understandable.

new_dic = {a: b for old_cat, str_name in zip(new_type_cat, new_type_name) for a, b in zip(old_cat, [str_name]*len(old_cat))}
df['new_category'] = df['OFNS_DESC'].map(new_dic)
df['new_category'].nunique() # 21
# The amount of unique category has been narrowed down to 21
# The detailed methodology and approach could be found at my GitHub

Exploratory Analysis

Total Number of Crime: Not really what I expected

According to this visualization, Brooklyn has the overall highest number of crime events, and Bronx, often considered a little messier area, has a lower crime number than Manhattan. This was a surprising finding for me.

With the data of the projected population in each borough (sourced from the Department of City Planning), I found out that though Brooklyn has the most events, Manhattan and Bronx both have a higher percentage of crime events per resident. (Bronx has the highest number of 6,887 per 100k residents). Another interesting thing is that Queens has the lowest average number of crime events (3,849 per 100k residents), even lower than the Staten Island (4,269 per 100k residents)

But this isn’t the end. I would still like to do some more exploratory visualizations to get more insight from the data.

1. Level of Crime: Understanding the Distribution

Changing back to the total count of events, I used the same visualization to go into more detail within the level of crime in each borough (There are three levels of crime in New York State: Violation, Misdemeanor and Felony). From the graph below, I can tell that Misdemeanor, an offense of which a sentence in excess of 15 days but not greater than one year may be imposed, is the most popular level of crime in each borough, and it consists a similar percentage in each group (about 52% to 57%; Manhattan has the highest percentage of 57.2%). The second popular one is Felony, a serious offense for which a sentence would be more than one year, and the third one is Violation, a lesser offense for which a sentence only be no more than 15 days.

This gives us the following interesting information:

  1. Staten Island has a substantially lower percentage of Felony than the other 4 boroughs (which have about 31%). This could mean that Staten Island is a much more peaceful area, with not only a lower total crime number but also less serious ones. (This also means that the Violation level of crime in Staten Island has a much higher percentage than all the other areas)
  2. To my surprise, Bronx also has a smaller percentage of Felony types of crime than the other 3 popular boroughs. This could mean that most of the crime that happens in Bronx is not pressing. This insight can affect the common belief of neighborhood security of Bronx. I will further examine this lead.

Statistics: Are boroughs’ frequency of crime at each level different from each other?

To compare the distribution of level type in each borough, Chi-squared test can be used (test of independence), with the null hypothesis as “each crimes’ happening borough is independent of the boroughs’ level-of-crime classification.”

The test result (p=0.00) shows there is a significant relationship between the variables, meaning different boroughs have a different distribution of level of crime. With the table of standardized residuals, the result showed that most of the real data is distinctly different from the expected value.

import scipy.stats as stats
import statsmodels.api as sm
chi2, p, dof, expected = stats.chi2_contingency(observed = two_way_table2)
print('chi-square statistic :', chi2)
print('p-value :', p)
print('degrees of freedom :', dof)
table2 = sm.stats.Table(two_way_table2)
table2.standardized_resids

Therefore, I could further confirm a few things:

  1. Bronx does have a significant low felony-level crime event than Manhattan, Brooklyn and Queens. Most of its crime data is contributed from lesser crime.
  2. Although Staten Island has overall fewer crime events, the violation-level crime event proportion is significantly higher in Staten Island. Since it doesn’t follow the regular distribution, this could mean that some specific crime events happen frequently in the borough.

These scenarios would be further broken down in the crime event analysis.

2. Crime type: Recognizing prevalent Crime Type

I further research into crime type for each borough level. I listed the top 5 crime types in each borough, and it seems that there is no major difference among each borough. In each borough, major types of crime are all about Larceny_Theft, Harassment, Assault, Criminal_Mischief_Property and Offenses_against_Public_Order_Administration. The only difference happens in Staten Island, which has significantly lower-proportion of Larceny_Theft and Criminal_Mischief_Property.

Statistics: Are boroughs’ frequency of each crime type different from each other?

The test for crime category result (p=0.00) shows there is a significant relationship between the variables, meaning different boroughs have different distributions of each crime type.

I summarized the result with a heatmap and list only the significant ones below:
p.s. The threshold was over 2.0 or below -2.0, and I also put the most significant high or low category in each area into a bold font

Brooklyn

  • Significantly high-frequent crime: Forgery, Burglary, Weapon Problem, Gambling, Robbery, Sex Crime & Social and Commercial-related Crime (Social_Commercial_related_Crime)
  • Significantly low-frequent crime: High-Value Theft, Assault, Frauds & Offenses against Public

Manhattan

  • Significantly high-frequent crime: High-Value Theft, Forgery & Sex Crime, Frauds
  • Significantly low-frequent crime: Harassment, Assault, Serious Assault, Arson, Property Mischief, Weapon Problem, Driving under the Influence, Motor Vehicle Theft, Offenses against Public, Robbery, Social and Commercial-related Crime & Traffic Laws Violations

Bronx

  • Significantly high-frequent crime: Serious Assault (Aggravated_Assault), Assault, Drug Problem, Weapon Problem, Harassment, Offenses against Public (Offenses_against_Public_Order_Administration) & Robbery
  • Significantly low-frequent crime: High-Value Theft (Larceny_Theft) & Burglary

Queens

  • Significantly high-frequent crime: Traffic Laws Violations, Assault, Burglary, Property Mischief, Harassment & Motor Vehicle Theft
  • Significantly low-frequent crime: Drug Problem, Weapon Problem, Forgery, Frauds, Gambling, High-Value Theft, Offenses against Public

Staten Island

  • Significantly high-frequent crime: Driving under the Influence, Property Mischief, Frauds, Harassment, Offenses against Public
  • Significantly low-frequent crime: High-Value Theft, Serious Assault, Assault, Burglary, Drug Problem, Weapon Problem, Gambling, Motor Vehicle Theft, Robbery, Sex Crime, Social and Commercial-related Crime, Traffic Laws Violations

Through this analysis, I have found some interesting insights:

  • Being the most popular place in terms of crime events, Brooklyn has a lower frequency of Fraud and High-Value theft. This is also true in Bronx, Queens and Staten Island.
  • The data from Bronx and Brooklyn shows high frequencies of a few public-concerning crime categories such as weapons, drugs, rubbery and serious assault. This could be one of the factors that some people have conflicting opinions about the neighborhood safety of these places.
  • Manhattan's popular crime types, except sex crime, are mostly not physical crimes (violent crime) but the crimes involve personal or business property. This could be because of the high proportion of business or commercial districts in this borough.
  • Queens has a distinctly high frequency of crime in regards to vehicle and traffic law. This is a little different from close boroughs such as Manhattan and Brooklyn. It could be that Queens’ residents own more vehicles to commute, or Queens has more vehicle-related industrial areas than other districts.
  • The data shows that compared to others, Staten Island has significantly more drunk driving issues and not so many violent issues that usually are seen in areas with high population density.
Photo by Brandon Jacoby on Unsplash

Conclusion and Next Steps

From the exploratory analysis and statistical approaches, I can say that different boroughs do have significant differences in terms of crime level and crime type with each other.

After this, I am intrigued to look into more data related to crime happening time, victim and suspect data, etc. I will combine the above findings with the new data to look into more detailed topics. (e.g. the correlation of the crime happening time with different crime types or the persons who got involved in)

Congrats and Thanks for your reading! Feel free to check up my Github for the full codes and drop a message at cl3883@columbia.edu

If it helps, press clap as many as you like: )

Jeff Lu

--

--