I've been working on a project like this for some time now - and wresting with whether I want to go the community-based vs. closed source model.
The problems listed below are pretty exact: huge data sets, lots of cleaning and normalizing, and the snail mail/cd problem is real. Additionally, I'd note that ~40% of the states [somehow] charge for the data...it takes six digits to get a snapshot of all 50 states - and certain states (looking at you FL) say that they do not store the historical, meaning you have to connect with the local BoE's to aggregate the data.
A part of me [now] wants to open source this because of the DNC's actions.
Not to pollute the thread, but I'm also really interested in this, I've worked at both the Clinton campaign and NGP VAN and it seems like a very worthwhile pursuit. If you're adding people, my email is in my profile.
I've also wanted to get involved in a open voterfile project like yours for a very long time. I'd love to connect as well. My email's my username at gmail, if you're interested.
If you open it up with addresses/phone/email-addresses, beware that the main users may be commercial marketers (i.e. junk mail senders), not campaigns. Also note that many states license the data with a restriction that it only be used for election purposes.
if I define open as, "your campaign would have to register and be verified"...then it abides by the state/fed rules for these datasets. I can't just throw the data on github.
It's not obvious to me how open access to this data would be intrinsicly bad, but I suppose that relies on the assumption that it's equally available to all parties (which may not (definitely not?)) be the case.
any pointers to the statuatory/regulatory guidelines would be appreciated.
The first question I ask people when talking about this project is, "Do you know your voter information is public?" About 85% are shocked and in horror that this information is available. Outside of a campaign, it is hard to say if the public would support it.
I've chatted with a two different lawyers over here in Ohio...and both have advised strict security and election/campaign use only.
a good overview, yes. For example, NY says "Election purposes only", but fails to mention that each infraction is a misdemeanor. And, Ohio is wrong. Ohio is campaign/election use only, also with the misdemeanor kicker.
Not technically difficult but incredibly tedious. First, you have to go out and collect it from all 50 Secretaries of State, and in come cases county officials. Some states send you the data on a CD (no joke). You then have to clean the data, which is often not in great shape, and then normalize it.
Even then, you only have a snapshot, because the states typically don't keep historical data. What this means is that your dataset won't be as good as someone who's been collecting this data for years, and thereby knows things you won't like where someone used to live, how often they voted there, who recently dropped off the registered voter rolls, etc.
In this case, even this data wouldn't be enough, because the Sanders team had made likely hundreds of thousands of contacts with voters, and recorded what issues they cared about and who they planned to vote for. This data, which they personally collected, is now inaccessible to them.
Except it wasn't just a list of voters. It also included "client scores". That means the Sanders campaign had access to modelling information regarding the Clinton campaign's list. It is pretty valuable knowing how the other campaign values and/or targets specific voters and that is something that obviously can't be found from public info.
My understanding is that the DNC contracts with VAN to manage the voter files for all fifty states. It's a shared database, with candidates able to build up their own data on top. All the campaigns can see the underlying voter data, but they additions they make are private to the individual campaign. The Sanders campaign staffers realized they were able to see Clinton campaign data they should not have access to. That's all that happened. The problem was fixed within a few hours. The Sanders people didn't abuse the bug in any significant way, in fact they reported it. But then the DNC cut off Sanders campaign access completely -- the nation-wide voter file, all their additions, inaccessible. At this point, the DNC response seems more punitive than security-related.
"The Sanders people didn't abuse the bug in any significant way"
It isn't clear if this is true.
"in fact they reported it"
VAN has not stated the issue was reported by the Sanders campaign. The claims that the Sanders campaign had reported an earlier issue are refuted in the OP, which states they had reported issues with another vendor's software.
It is possible the bug was abused:
"The database logs created by NGP VAN show that four accounts associated with the Sanders team took advantage of the Wednesday morning breach. Staffers conducted searches that would be especially advantageous to the campaign, including lists of its likeliest supporters in 10 early voting states, including Iowa and New Hampshire. Campaigns rent access to a master file of DNC voter information from the party, and update the files with their own data culled from field work and other investments.
After one Sanders account gained access to the Clinton data, the audits show, that user began sharing permissions with other Sanders users. The staffers who secured access to the Clinton data included Uretsky and his deputy, Russell Drapkin. The two other usernames that viewed Clinton information were “talani" and "csmith_bernie," created by Uretsky's account after the breach began.
The logs show that the Vermont senator’s team created at least 24 lists during the 40-minute breach, which started at 10:40 a.m., and saved those lists to their personal folders. The Sanders searches included New Hampshire lists related to likely voters, "HFA Turnout 60-100" and "HFA Support 50-100," that were conducted and saved by Uretsky. Drapkin's account searched for and saved lists including less likely Clinton voters, "HFA Support <30" in Iowa, and "HFA Turnout 30-70"' in New Hampshire.
Despite audit logs, Weaver said at the news conference that NGP VAN has told the campaign that no Clinton data was printed or downloaded."
A fresh account, commenting on a new, extremely controversial issue should probably disclose affiliations before getting too embroiled in arguing interpretations and facts.
It seems unlikely that someone so disconnected from society as to be completely uninvolved and unopinionated on the topic at hand would have created an account for the specific purpose of commenting on the story.
We are social creatures; we have a deep need to align ourselves with groups of others. And the evidence points to our having a deep need to argue as well.
Hmm, that's much more specific, and seems to go beyond simply establishing that there's a permissions problem and they can see data they shouldn't. I still think cutting off the Sanders campaign from all their data, even after the bug was fixed, is over the top. Perhaps the staffers did act inappropriately or aggressively, but deal with them, and let the campaign continue with its daily business.
The Sanders campaign had in October reported an unrelated software issue in a non-VAN system.
They did not report THIS issue to the DNC or NGP VAN, but claim they were gathering information about the breach for the purposes of reporting. Based on my reading of the OP, the breach was discovered by NGP VAN employees.
The fact that every search, export, and page access is logged, and that the Sanders people knew everything they did would be traceable, do you think they really would have tried to hoover up a bunch of Clinton data? The VAN CEO is emphatic that none of the exposed data was ever exportable.
My read on this is that very few people involved in the conversation have a very good understanding of technology and gp's speculation matches my own understanding.
The DNC/NGP voter file, however, is significantly enhanced - for one, it's got a lot of phone numbers, which most states don't include in their lists. There's a lot of other survey and consumer data associated.
For a national voter file that's good enough to use for say a City Council race anywhere in the country, where all you want to know is "who is likely to vote in this non-presidential election", it's doable but very expensive. For anything serious, it's pretty much outside of capabilities of anyone but the parties and some of the very big SuperPACs/very big orgs.
As I understand it, the data set contains proprietary information from the campaigns using it, so it is impossible to reconstruct it from any public source, or really any source unless the campaign has retained separate copies of all the data (which is probably impractical.)
The problems listed below are pretty exact: huge data sets, lots of cleaning and normalizing, and the snail mail/cd problem is real. Additionally, I'd note that ~40% of the states [somehow] charge for the data...it takes six digits to get a snapshot of all 50 states - and certain states (looking at you FL) say that they do not store the historical, meaning you have to connect with the local BoE's to aggregate the data.
A part of me [now] wants to open source this because of the DNC's actions.