Hacker News new | ask | show | jobs
by policepost 1074 days ago
The paper covers this quite explicitly in sections 3 and 4.1.
1 comments

They mention they have some measures to "compensate for non-representative sampling in the dataset"

> we are reweighting the Nexar data sample, which is sampled from a non-representative set of locations, so that it matches the locations where different demographic groups actually live. For example, to calculate the police deployment levels that Asian residents of New York City experience, we reweight the original data sample to upsample neighborhoods with larger Asian populations.

and

> For example, if vehicles are prohibited from driving near protest areas, which also have larger police presences, we will not have images of large police presences near protests. It is not possible to correct for this bias with the data we have because 1) the true distribution may differ from the Nexar sampling distribution along unobservable dimensions which we cannot reweight along and 2) we may simply have no Nexar images in some regions of the true distribution (e.g. if all vehicles are banned near protests). A second potential bias is that police vehicles represent only a subset of overall police activity: for example, they do not capture officers on foot. We return to both these points below.

Not sure whether you meant this by "explicitly" but I guess the answer is that they didn't correct for it.

> Before describing the details of the framework, the high-level intuition is that we are reweighting the Nexar data sample, which is sampled from a non-representative set of locations, so that it matches the locations where different demographic groups actually live. For example, to calculate the police deployment levels that Asian residents of New York City experience, we reweight the original data sample to upsample neighborhoods with larger Asian populations.

> Overall, our estimation procedure compensates for two types of potential bias. Equation 2 compensates for a data bias, reweighting the Nexar dataset (which is sampled from a set of locations which does not necessarily match the population distribution; Figure 1) to match the population distribution of demographic subgroups. This is conceptually similar to inverse propensity weighting procedures [4] which are used to compensate for non-representative data in other settings. Equation 3 compensates for imperfect model performance, and allows us to check that model performance is unbiased (i.e., calibrated) across demographic subgroups.

Section 4.1 goes into the mathematical functions they use to address the data set.

Section 3.2 describes the data set and how it is geographically distributed.

> Data was provided to us by Nexar in two phases. Phase 1 consists of 3,987,835 images sampled prior to September 1 2020, and is extremely geographically and temporally skewed. Geographically, it is concentrated within the boroughs of Manhattan and Brooklyn, and does not contain data from the boroughs of Staten Island, Queens, and the Bronx at all; temporally, it overrepresents data from Thursday nights. Phase 2, which constitutes the majority of the dataset, consists of 20,816,019 images sampled after October 4 2020, and is much more geographically and temporally representative: it is sampled at all times of the day, on all days of the week, and also covers the entire geographic area of New York City.

> Because Phase 2 is much more representative than Phase 1, we conduct our primary analysis of disparities using only data from Phase 2. We additionally conduct numerous validations and bias corrections, described in ยง4.1, to compensate for non-representative sampling in the dataset. Geographic and temporal coverage during the Phase 2 period is very good. Specifically, 100% of hours during the Phase 2 period are covered; 99.6% of Census Block Groups (CBGs)3 have at least one image, with a mean of 168.2 images per CBG; 88% of roads contained within the borders of New York City are covered by at least one image, using data from OSMNX [6]. Figure 1 summarizes geographic data availability; Figure S1 summarizes temporal data availability.