Analyzing DataSF.org data

Given I'm a data analyst by profession, it's time for me to post what I do for my day job. Most of my work is corporate, so unfortunately can't share those analyses here. But here's something I did in my free time, spent a few hours looking at the city of San Francisco's public data on datasf.org. From that website, I grabbed all the San Francisco Police Department historical incident records, 2003-2012, and loaded the .csv files into R and Tableau. The raw data are logged police "incidents" or reported crimes, tagged by geo-location (cross-streets in neighborhoods but not specific addresses), time of day, category of crime, description, and resolution status. What I discovered could be titled "San Francisco Crime: urban myth vs. reality".

The first urban myth, one I've heard at Silicon Valley parties, is that "prostitution is a big problem in San Francisco". Actually it's not. Looking at crimes by category city-wide, year after year, the most common crimes are "Larceny" i.e. theft, "non-criminal", "other", and "assault". Prostitution isn't even in the top 20 crime categories by frequency. Year over year, larceny remains the most frequent crime. Vehicle theft dropped from 3rd most frequent crime category back in 2003 to barely in the top 10 now. Overall crime has trended down (approx 3%/year). Diving in deeper, we see that the top crime months tend to be January, March, August and October. Also, top crime days of the week are Fridays and Wednesdays.


Another urban myth I've heard is that "homicide happens more often wee hours night and morning than when regular people are out". Untrue. Again, the data shows homicide happens all hours, especially 11am and early evenings 6-8pm. It almost looks like murder happens the most just before lunch and dinner! Here I've downloaded homicide incidents from the entire Bay Area (including Oakland) for the last 6 months.


What about neighborhoods, you ask? Now it seems, some rumours you hear are true. While city-wide larceny is the main crime, once we delve into neighborhoods, we see distinct crime personalities. Most dramatic is the high proportion of "drug" crimes in the Tenderloin. Carjacking is high proportionally in Ingleside. But the Rincon Tower of crime here is larceny (or theft) in South of Market. There's almost twice as much theft happening in the Southern Police District (which includes the Ferry Building, Giants Ballpark, Caltrain station, and Folsom/11th night club area) compared to any other neighborhoods. While BayView, the notoriously "bad gangs" neighborhood, has almost as much violent crime (e.g. assault) as larceny, the astute eye will see that Mission and Southern have, by quantity, actually more assault than Bayview. The chart below is split into upper - crime category profiles per neighborhood over all years, and lower - per year frequencies of crimes per neighborhood. Overall Southern has stood tall in larceny all this time; while drugs in the 'loin peaked around 2008. It should be noted that police reporting districts are close but don't exactly correspond to the common names for local neighborhoods.


Southern has the most crime, closely followed by the Mission, but the Tenderloin followed by Mission have the highest resolution rates. This means if you report a crime in the Mission you are more likely to get a police resolution than if you report a crime in Richmond, for example. Maybe because drug incidents are more easily "booked" and resolved than other types of crimes?


Next, what's interesting is to look at correlations between types of crime. In the chart below, the size and darkness of the circles indicate high positive correlation, meaning those two crimes tend to happen together at the same times and places. Size and redness of the circles indicate high negative correlation, meaning they usually didn't happen at the same time or place. In the graphic below, the darkest biggest circles are on the diagonal since anything has correlation=1 with itself. The graph is symmetric, so you only need to look above or below the diagonal. Kidnapping and weapons appear highly correlated. Maybe that's expected? How about recovered vehicle with weapons and arson? Does it make sense that drugs are negatively correlated to runaways and vehicle theft? Maybe that's because runaways and car thieves don't go to the Tenderloin? Prostitution and Pornography appear to be focused, connected crimes. "Forcible sex" i.e. rape is correlated with assault, robbery, kidnapping, stolen property and trespass. Some crimes seem more broadly correlated with lots of other crimes. It's important to remember at this point that correlation has nothing to do with causation.


The next thing to do is plot crimes city-wide by time of day. This should show us crimes that are closely related by frequency and time but not necessarily location. We'd expect to see crimes that could travel show up here. Indeed paired crimes "warrants and drugs" that we saw in the correlations graph jump out again here. "Larceny and vehicle theft" is a pair we didn't see earlier though.

One visualization trick I've learned is to make a grid of pairwise X-Y line charts and look for straight lines - those are suspected fruitful regression variables. Looking at the grid of pairs, we can pick out the pairs "warrants and drugs" and "larceny and cartheft" like we found above. In addition, "larceny" and "warrants" look the most related to the most number of other crimes. Running step-wise regression on this dataset would be the best way to pick out even deeper patterns that our naked eyes can't see.


The next step would be to take some of these findings to the Police Department, and find out what the field experts say, and whether knowing such things could help guide the police where to focus their presence?

Next step beyond that, is find out how does San Francisco crime profile compare to other large cities? I suspect there will be overall trends in common as well as distinct differences city-by-city as we saw neighborhood-by-neighborhood.