Introducing unverified breaches to Have I been pwned

Data breaches can be shady business. There's obviously the issue of sites being hacked in the first place which is not just shady, but downright illegal. Then there's the way this information is redistributed, the anonymous identities that deal with it and the various motives people have for bringing this data into the public eye.

One of the constant challenges with the spread of data breaches is establishing what is indeed data hacked out of an organisation versus data from another source. We've seen many recent cases where representations of a data breach have been made and the claim subsequently well and truly disproved. For example, the recent case where it was claimed that 272 million accounts had been stolen from Hotmail, Yahoo, Gmail and Mail.ru. The mail providers subsequently confirmed that no, this was not the case. Same again for recent claims that there were 32 million Twitter accounts on the loose. Twitter quickly debunked this and speculation that they were obtained via malware has never been substantiated.

The first thing I try and do when I see a new data breach is establish if it's legitimate and I've written before about how I do this. Under no circumstances do I want to end up in a situation where I'm making a claim about an organisation being hacked which is then proven to be false, not only because of the potential reputation damage to the company, but because of the unnecessary angst it causes for those involved in the incident. Plus, any claims of this nature are being made by me as an identifiable individual; I'm not hiding behind the veil of anonymity and shirking any responsibility associated with getting my facts wrong. Integrity is essential, particularly in an area of security so frequently lacking it.

But here's the problem and the catalyst for writing this post: sometimes there are breaches where I just can't be certain of the authenticity, yet there are many indicators which point to an actual breach. The incident sits in that grey area between "very unlikely to be legitimate" and "almost certainly legitimate". For example, the Badoo breach. They've denied the data came from them so that in itself is an important factor to consider. That doesn't necessarily mean they're right, but it's a factor involved in my confidence level, particularly when the likes of LinkedIn and MySpace openly acknowledged the legitimacy of their recent breaches. The Badoo data itself is... eclectic. Here's the first row of the breach file:

INSERT INTO 'User66' VALUES ('11917635', '62', '0', '8', '0', 'None', '67', '7636', '265791', '0', 'W', 'No', '..::_\|/_::..', '', '', 'Default', 'Y3B0ZmluZHVzQHN1cGVyZXZhLml0', 'Yes', '0000-00-00 00:00:00', '', 'No', '0000-00-00 00:00:00', 'No', 'On', 'On', 'On', 'Default', 'Default', 'Default', 'On', 'On', 'On', 'Default', '11917635.onirc.cptfindus', '0e19a8bac63f97a513063dcb9a64442b', 'Default', 'UbLHyDFVtm', '1979-10-07', '29', 'M', 'No', '29', '568', '45661', '29', '0', '0', '0', '0', '0', '0', '', '0', '22555', 'enAg2oQmyS', '0', 'Yes', 'Yes', 'Yes', 'Email', '', '2013-03-14 15:03:11', '2006-12-02 00:10:37', 'No', 'Active', 'Deleted', '2009-06-05 09:38:16', '2006-12-02 00:16:14', '1990-01-01 00:00:00', '0000-00-00 00:00:00', '0000-00-00 00:00:00', 'No', 'No', '2006-12-02 00:15:17', '0000-00-00 00:00:00', '0000-00-00 00:00:00', 'Yes', 'Yes', '0', 'No', 'New', '2007-07-13 13:31:39', 'No', 'None', '0000-00-00 00:00:00', '0000-00-00 00:00:00', '0000-00-00 00:00:00', 'Changed', 'No', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'NotActive', 'NotActive', '', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', '16777216', '0', '0', '0', '', '0', 'Default', 'No', 'Default', '0', '0', '0000-00-00 00:00:00', '0', '0000-00-00 00:00:00', 'On', 'Default', 'Yes', 'Web', 'Commercial', '0000-00-00 00:00:00', 'Yes', 'Yes', '', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'Default', 'No', '0000-00-00 00:00:00', 'F', '24', '39', '10001', null, 'NOT_SET', 'No', 'Default', null, '0', 'No', '0', '0', null, null);

This implies the presence of many interesting fields, yet every subsequent row is inconsistent with this insert statement and contains significantly less information. For example, here's a sample Mailinator account (these are often disposable addresses used by individuals who are not creating genuine accounts they actually intend to use):

177459377:[redacted]@mailinator.com:0177459377:32ce6f311197613d6e77d31a66af52c0:Bah:Bah:Doo:1969-07-21:43:M:9:874:132959

Here we have an ID, email (the alias also contains the word "badoo"), username, password, first and last name (clearly fabricate above, but seemingly legitimate on many other records), what's possibly a username or alias, birth date, gender and what are likely foreign keys to other tables. The Badoo website confirms the existence of the email address via the password reset feature and that MD5 password hash has a plain text value of... "badoo".

But then when looking at the Badoo site, there's inconsistencies with the data. For example, you can't create an account with an email address that uses Mailinator:

Can't use Mailinator on Badoo

However, they'll happily allow one of Mailinator's hundreds of alternative domain names (such as spamhereplease.com). Now this doesn't mean that the account above didn't come from Badoo, it may simply mean that at some time after it was originally created they changed their policy on addresses and disallowed that host name. I thought I might see similar behaviour when creating a password but no, Badoo will still happily allow a password of "badoo". There are 49,941 "badoo" passwords in the dump...

This exercise gives me some degree of confidence in the legitimacy of the breach, but the same process with other records was much less conclusive. Particularly for an incident of this size, I didn't want lingering doubts so I needed to reach out further.

Over recent weeks, I've been in contact with dozens of Have I been pwned (HIBP) subscribers who are in the alleged breach. I've been using them to help sanity check the data and the results have been... mixed. With only a limited set of data available to actually verify whether it actually came from Badoo, I provided snippets to the alleged owner and asked them not just if the data itself was correct, but if they'd ever created an account on Badoo. Often I'd get a simple confirmation - "Yes, I had an account there and yes, the data is correct". Other times they were adamant they'd never created an account but their personal attributes were accurate. Then in some cases, none of the data was accurate.

Now negative responses don't necessarily mean that someone didn't have an account; they could have forgotten, they could have created one with another service that Badoo since acquired or someone could have simply signed them up without their knowledge. All of these are possible. Problem is, in some cases, people would respond like this:

I get a message saying incorrect email address

This was after I suggested that one of my HIBP subscribers issue a password reset for what was allegedly their account. It's possible their email was genuinely in the system and it was simply "soft deleted" (the record was still there but merely flagged as inactive), but it's also entirely possible that Badoo has never seen this individual before. Similar story here:

I did try to ask them for password recovery since I receive your email and never had an email sent to me

But then there were confirmations from others:

Once I signed on to that site, but I did delete it

Or this slightly cryptic one:

Yes, the account has been set up but not by me. There were my details. I sent a request to the Badoo team, to removed this account. I've got a reply they have done already.

So you see the challenge in terms of verification when there are both positive and negative indicators of legitimacy. It would be irresponsible to make an outright claim that "Badoo was hacked", yet by the same token there is a very high likelihood that at least some of the data has come from them. Ultimately, the only conclusion I can emphatically reach is that the data is "unverified", which brings me to the concept of unverified data breaches in HIBP.

This is actually not a new thing, in fact I created a UserVoice idea for it back in 2014. Actually, I did more than that, I integrated the concept of an unverified breach into the underlying data model of the system and whilst I didn't publish anything in the API documentation, there's been an "IsVerified" flag returned with the breach model for some time now. Until today, it's always returned "true", but Badoo marks the first unverified breach I've loaded.

Because it's unverified, it's important I indicate that whenever the breach is described in the system. The first place you'll see that is on the homepage as it's within the top 10 breaches loaded into the system in terms of size:

Badoo flagged as an unverified breach

The next is that if you search for an email address and it appears in an unverified breach, there'll be an indicator in the description. Now this isn't possible with Badoo because being a dating site it's also flagged as "sensitive" which means you can't search for it publicly. However, those who've subscribed to HIBP's free notification service can still view everything they've been pwned in by following the link in the email they receive when signing up (you can come back and do this even if you're already subscribed):

Badoo breach description

Because it's there in the description of the incident, anyone who appears in the data breach and receives an email notification will see a clear explanation of the unverified nature of the data with a link through to this blog post. The point is that I want to ensure at every possible opportunity, the unverified status of the data is made perfectly clear.

I put a lot of thought into how to handle this incident and that combined with reaching out to so many HIBP subscribers in the data set has meant loading it all the month after it originally appeared. One of the key factors driving this approach is that even if not all the data is accurate and some of it doesn't align with what Badoo holds in their system, this is people's personal data floating around the web and they want to know about it. There is certainly a threshold beneath which I won't load a "breach" regardless of how many people are in there - I still need to have a sufficient degree of confidence in it - but that's a judgement decision I'll have to make on a case by case basis.

I harp on about this but it's really important: dealing with data breaches responsibly at every turn is really, really important. Misrepresenting a data breach without doing sufficient research to establish legitimacy would be reckless and makes an already bad situation worse. This process isn't always easy, but it's the right thing to do and whilst I doubt my position on this will have much influence over the data breach handling industry in general (for want of a better term), hopefully it demonstrates that there are ways of handling these incidents that can act in the best interests of all involved.

Have I Been Pwned