Hacker News new | ask | show | jobs
by sethbannon 3838 days ago
For those that are not familiar with the space, campaigns typically use voter contact software to record the results of the conversations they have with potential voters on the phones, at the doors, and over the Internet. In this case, the voter contact software that both the Hillary and Sanders campaigns were using, NGP VAN, had a bug which allowed both campaigns to access each other's private, proprietary data (in this case, I believe, modeling data).

The Data Director on the Sanders campaign discovered the error and (he claims) was verifying and documenting the bug, which was then reported to the Democratic National Committee (DNC) and NGP VAN. The DNC claims these actions were not in good faith, and as a reaction cut the Sanders campaign off from the system.

This is a BIG deal for a campaign, so close to the first elections. Campaigns rely on that data to inform nearly everything they do, and rely on access to such tools to conduct their voter outreach program. Being cut off from the system is crippling for a campaign, likely why the Sanders campaign so quickly sued to get its access reinstated [1].

[1] - http://www.politico.com/story/2015/12/sanders-campaign-threa...

edit: typos

6 comments

For an alternative perspective:

"The database logs created by NGP VAN show that four accounts associated with the Sanders team took advantage of the Wednesday morning breach. Staffers conducted searches that would be especially advantageous to the campaign, including lists of its likeliest supporters in 10 early voting states, including Iowa and New Hampshire. Campaigns rent access to a master file of DNC voter information from the party, and update the files with their own data culled from field work and other investments. After one Sanders account gained access to the Clinton data, the audits show, that user began sharing permissions with other Sanders users. The staffers who secured access to the Clinton data included Uretsky and his deputy, Russell Drapkin. The two other usernames that viewed Clinton information were “talani" and "csmith_bernie," created by Uretsky's account after the breach began. The logs show that the Vermont senator’s team created at least 24 lists during the 40-minute breach, which started at 10:40 a.m., and saved those lists to their personal folders. The Sanders searches included New Hampshire lists related to likely voters, "HFA Turnout 60-100" and "HFA Support 50-100," that were conducted and saved by Uretsky. Drapkin's account searched for and saved lists including less likely Clinton voters, "HFA Support <30" in Iowa, and "HFA Turnout 30-70"' in New Hampshire. Despite audit logs, Weaver said at the news conference that NGP VAN has told the campaign that no Clinton data was printed or downloaded."

http://www.bloomberg.com/politics/articles/2015-12-18/sander...

"Saving the list" entails creating a copy of the list on the VAN servers (technically, creating an SQL query). It does not mean copying any of the data locally where it could be kept.

It demonstrates the ability of the Sanders campaign to access the Clinton data without actually having the ability to use it once the breach was sealed, which, like the previous breach, it would inevitably be.

It's like making a copy of the personnel files left in the mailroom and sticking them in your mailbox. Lets you demonstrate they got left out in case VAN tries to say the breach wasn't serious.

"Despite audit logs, Weaver said at the news conference that NGP VAN has told the campaign that no Clinton data was printed or downloaded."

The phrasing here strikes me as somewhat vague. Are they implying that Weaver's statements are in conflict with the audit logs, or are they (somewhat ineffectively) implying that "saving lists" merely equates to bookmarking a certain query?

NGP stated:

"So for voters that a user already had access to, that user was able to search by and view (but not export or save or act on) some attributes that came from another campaign."

What exactly do they mean by "view", let alone "act on"? If someone was truly dedicated to extracting data through their browser, are the terms truly mutually exclusive?

Here's an interview the Sanders campaign staffer who was fired gave explaining just what you describe: http://www.msnbc.com/thomas-roberts/watch/fired-sanders-camp...
Wow, that interview is incredibly hostile.
He is trying to pretend that they accessed the data for noble reasons and not taking advantage of the breach while it lasted when that obviously seems to be the case. They did something wrong but the problem here is that the DNC's response is heavy handed and unfair.

If this guy wants to help the campaign (which he obviously still do) he should stop giving more interviews.

How so? It seemed completely fair to me - yes, he pushed the staffer a bit but would there have been a point to the interview if he hadn't?

I'm not quite sure if I believe the Sanders campaign story, but that seems independent from the quality of the interview; even if you do, it shouldn't be a bad thing to ask more questions of individuals involved.

Push him on it, sure. After the 5th time of accusing him of stealing the data it felt a bit more like badgering the same point.

I think there are certainly some parallels as mentioned by the parent of the what constitutes a understanding of a breach vs ethics of accessing a system.

I'm not quite sure if I believe the Sanders campaign story

Well, if they didn't report, then we'd have more reason to doubt, no? Self-reporting deserves benefit of the doubt.

edit: Never mind my comment, now more clear on the situation.

That's interesting, thanks for adding that explanation for us not in the space.

What I'm surprised about is that the campaigns are willing to let this data be stored in the cloud on shared systems. I would have expected all proprietary data to be stored locally by each campaign on private in-house servers, probably with periodic data dumps of updates from the data provider.

Why?

Why put forth the expense of obtaining (purchase or rent) hardware and staff to maintain that hardware? Additionally, why put forth the time and expense to write or compose a CRM-like software solution that integrates with voter data, what sounds like a dialer/call center, and "big data" tools (Spark, Hadoop, Tableau, SSIS/SSRS) that probably needs a good 6 months lead time before the candidate even announces a run for office? Also, why would every potential candidate do this every 4 years?

Sounds like a perfect choice for a hosted solution that can be iterated on outside of the election cycle.

> "big data" tools (Spark, Hadoop, Tableau, SSIS/SSRS)

whoa man...those tech's aren't even in play. lol. It's MSSQL and Oracle, with .net web apps [usually] running on top.

building on hadoop/spark is way outside their wheelhouse.

How many businesses use Google for email and document storage and run their entire system on AWS?

Private in-house servers are very expensive to set up and maintain. Nearly everyone stores vital personal information on someone else's servers.

> is that the campaigns are willing to let this data be stored in the cloud

Not the campaigns...the parties.

Difficulty level in replicating this dataset from secretary of state rolls?
I've been working on a project like this for some time now - and wresting with whether I want to go the community-based vs. closed source model.

The problems listed below are pretty exact: huge data sets, lots of cleaning and normalizing, and the snail mail/cd problem is real. Additionally, I'd note that ~40% of the states [somehow] charge for the data...it takes six digits to get a snapshot of all 50 states - and certain states (looking at you FL) say that they do not store the historical, meaning you have to connect with the local BoE's to aggregate the data.

A part of me [now] wants to open source this because of the DNC's actions.

> A part of me [now] wants to open source this because of the DNC's actions.

Was my thought exactly. Aggregate, then open.

Would love to hear more and see if we can't collaborate on this. My email is seth AT amicushq DOT com.
i'll fire an email your way shortly. but yea, a conversation would be great.
Not to pollute the thread, but I'm also really interested in this, I've worked at both the Clinton campaign and NGP VAN and it seems like a very worthwhile pursuit. If you're adding people, my email is in my profile.
I've also wanted to get involved in a open voterfile project like yours for a very long time. I'd love to connect as well. My email's my username at gmail, if you're interested.
If you open it up with addresses/phone/email-addresses, beware that the main users may be commercial marketers (i.e. junk mail senders), not campaigns. Also note that many states license the data with a restriction that it only be used for election purposes.
if I define open as, "your campaign would have to register and be verified"...then it abides by the state/fed rules for these datasets. I can't just throw the data on github.
Can you comment on this a bit?

It's not obvious to me how open access to this data would be intrinsicly bad, but I suppose that relies on the assumption that it's equally available to all parties (which may not (definitely not?)) be the case.

any pointers to the statuatory/regulatory guidelines would be appreciated.

The first question I ask people when talking about this project is, "Do you know your voter information is public?" About 85% are shocked and in horror that this information is available. Outside of a campaign, it is hard to say if the public would support it.

I've chatted with a two different lawyers over here in Ohio...and both have advised strict security and election/campaign use only.

Not technically difficult but incredibly tedious. First, you have to go out and collect it from all 50 Secretaries of State, and in come cases county officials. Some states send you the data on a CD (no joke). You then have to clean the data, which is often not in great shape, and then normalize it.

Even then, you only have a snapshot, because the states typically don't keep historical data. What this means is that your dataset won't be as good as someone who's been collecting this data for years, and thereby knows things you won't like where someone used to live, how often they voted there, who recently dropped off the registered voter rolls, etc.

In this case, even this data wouldn't be enough, because the Sanders team had made likely hundreds of thousands of contacts with voters, and recorded what issues they cared about and who they planned to vote for. This data, which they personally collected, is now inaccessible to them.

edit: expounded

Except it wasn't just a list of voters. It also included "client scores". That means the Sanders campaign had access to modelling information regarding the Clinton campaign's list. It is pretty valuable knowing how the other campaign values and/or targets specific voters and that is something that obviously can't be found from public info.
> That means the Sanders campaign had access to modelling information regarding the Clinton campaign's list.

And also means that Clinton's campaign had access to the Sanders' campaign data.

aristole and ngpvan "scoring" isn't a game changer. Losing access to their existing work is what matters.
My understanding is that the DNC contracts with VAN to manage the voter files for all fifty states. It's a shared database, with candidates able to build up their own data on top. All the campaigns can see the underlying voter data, but they additions they make are private to the individual campaign. The Sanders campaign staffers realized they were able to see Clinton campaign data they should not have access to. That's all that happened. The problem was fixed within a few hours. The Sanders people didn't abuse the bug in any significant way, in fact they reported it. But then the DNC cut off Sanders campaign access completely -- the nation-wide voter file, all their additions, inaccessible. At this point, the DNC response seems more punitive than security-related.
"The Sanders people didn't abuse the bug in any significant way"

It isn't clear if this is true.

"in fact they reported it"

VAN has not stated the issue was reported by the Sanders campaign. The claims that the Sanders campaign had reported an earlier issue are refuted in the OP, which states they had reported issues with another vendor's software.

It is possible the bug was abused:

"The database logs created by NGP VAN show that four accounts associated with the Sanders team took advantage of the Wednesday morning breach. Staffers conducted searches that would be especially advantageous to the campaign, including lists of its likeliest supporters in 10 early voting states, including Iowa and New Hampshire. Campaigns rent access to a master file of DNC voter information from the party, and update the files with their own data culled from field work and other investments. After one Sanders account gained access to the Clinton data, the audits show, that user began sharing permissions with other Sanders users. The staffers who secured access to the Clinton data included Uretsky and his deputy, Russell Drapkin. The two other usernames that viewed Clinton information were “talani" and "csmith_bernie," created by Uretsky's account after the breach began. The logs show that the Vermont senator’s team created at least 24 lists during the 40-minute breach, which started at 10:40 a.m., and saved those lists to their personal folders. The Sanders searches included New Hampshire lists related to likely voters, "HFA Turnout 60-100" and "HFA Support 50-100," that were conducted and saved by Uretsky. Drapkin's account searched for and saved lists including less likely Clinton voters, "HFA Support <30" in Iowa, and "HFA Turnout 30-70"' in New Hampshire. Despite audit logs, Weaver said at the news conference that NGP VAN has told the campaign that no Clinton data was printed or downloaded."

http://www.bloomberg.com/politics/articles/2015-12-18/sander...

A fresh account, commenting on a new, extremely controversial issue should probably disclose affiliations before getting too embroiled in arguing interpretations and facts.
...and if they don't have any affiliations?
It seems unlikely that someone so disconnected from society as to be completely uninvolved and unopinionated on the topic at hand would have created an account for the specific purpose of commenting on the story.

We are social creatures; we have a deep need to align ourselves with groups of others. And the evidence points to our having a deep need to argue as well.

Hmm, that's much more specific, and seems to go beyond simply establishing that there's a permissions problem and they can see data they shouldn't. I still think cutting off the Sanders campaign from all their data, even after the bug was fixed, is over the top. Perhaps the staffers did act inappropriately or aggressively, but deal with them, and let the campaign continue with its daily business.
To clarify:

The Sanders campaign had in October reported an unrelated software issue in a non-VAN system.

They did not report THIS issue to the DNC or NGP VAN, but claim they were gathering information about the breach for the purposes of reporting. Based on my reading of the OP, the breach was discovered by NGP VAN employees.

>The Sanders people didn't abuse the bug in any significant way, in fact they reported it.

That is pure conjecture and I'm not even sure the Sanders campaign would agree with you considering they have fired a staffer over this.

The fact that every search, export, and page access is logged, and that the Sanders people knew everything they did would be traceable, do you think they really would have tried to hoover up a bunch of Clinton data? The VAN CEO is emphatic that none of the exposed data was ever exportable.
My read on this is that very few people involved in the conversation have a very good understanding of technology and gp's speculation matches my own understanding.
See the MSNBC interview with the fired staffer. They were trying to understand the extent of the disclosure in preparation for reporting it.
Very. Expensive, and many states have legal restrictions on what you can use it for and who can get it.

Nationbuilder, a sorta-competitor to NGP, has put together a national voter file and it's reasonably priced. https://elections.nationbuilder.com/about/faq

The DNC/NGP voter file, however, is significantly enhanced - for one, it's got a lot of phone numbers, which most states don't include in their lists. There's a lot of other survey and consumer data associated.

For a national voter file that's good enough to use for say a City Council race anywhere in the country, where all you want to know is "who is likely to vote in this non-presidential election", it's doable but very expensive. For anything serious, it's pretty much outside of capabilities of anyone but the parties and some of the very big SuperPACs/very big orgs.

As I understand it, the data set contains proprietary information from the campaigns using it, so it is impossible to reconstruct it from any public source, or really any source unless the campaign has retained separate copies of all the data (which is probably impractical.)
Bad faith seems like a pretty nefarious claim. For all we know Hillary's campaign was accessing Sander's data this whole time. The breach went both ways.
Well the press release states: "Our team removed access to the affected data, and determined that only one campaign took actions that could possibly have led to it retaining data to which it should not have had access."
The Sanders campaign did not report the recent issue to the DNC or NGP VAN.

The Sanders campaign had reported a different issue with a different vendor's software in the past.