June 18, 2013

Data Leakage In A Google World

Gg3In the past, information leakage conjured images of securing data from physical theft (remember the alleged FBI laptop?) but thanks to the web, organizations need to secure information from growing “search giants”. In short, data leakage has taken on a whole new meaning in the Google age.

The causes of data leakage range from simple misconfiguration or improper classification of data, which makes it possible for Web servers to publish private and/or sensitive information, to users unwittingly (or not) storing sensitive data where they shouldn’t, Even best business practices such as site scraping -- a known method for gathering competitive intelligence in which company A automatically reads company B’s website for available data like price tables, and uses it to cut its own prices and remain on top – can lead to leakage of information.

Search Engines

By their very nature, search engines are the Internet’s biggest and most public Indexers. Search engines analyze websites, indexing them for the benefit of everyone who has ever done an Internet search. One urban legend even states that Google has a complete copy of the entire public Internet on file for data mining and analysis purposes.

As consumers and users of the Internet, we like to believe that most organizations do their best to remove sensitive information from their websites, FTP sites and other front facing business applications, As it turns out, however, this is not always the case.

Google Tables Search

Google has always been a pioneer of search algorithms, search visibility and advanced indexing, remaining one step ahead of other search engines, and introducing news ways to tag images by context, and even FTP sites for content. Their recently added Table Search capabilities, however, really brings to surface the idea of impact of data leakage in an indexed world.

This is not to say that Google is causing data leakage, but abilities like “Indexed FTP”,”Sesrch by image”, and now “Table Search” offer new ways to discover and extract data, which would otherwise have remained undiscovered.

Here’s one scary example (you can think of other interesting tables: PII, salaries,CC,…)

We used: http://research.google.com/tables?hl=en&q=username+password


Which is a structured representation of -


What is the Security takeaway?

The takeaway here is that Web security is more important now than ever before. Obviously, it doesn’t make sense to block Google from indexing your site (a business driver), but you should be aware of what content you are allowing access to, and who is accessing it.

Companies should:

  1. Implement web application security to mitigate hacker risk.
  2. Validate the content that is accessible via your web servers on a regular basis and/or implement policies to check for outgoing data.
  3. Implement policies to mitigate Bots that may scrape your website for available content.

The bottom line is that no one really wants to block Google from indexing your website, however controlling the content that your website serves is important. Organizations should note that once content is up there, the “search giants” will index it and with ever-evolving mechanisms it will become easier to get around leaked information.

Authors & Topics:

Share on LinkedIn


Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Comments are moderated, and will not appear until the author has approved them.