June 18, 2007
With new web services appearing every day our information is increasingly spread out over a myriad of different platforms. As a result, active web users are forced to log into dozens of different sites to keep up with their online life. There are a lot of people currently working to solve this problem with content aggregation. The idea is that you should be able to log in to one website and see all your relevant information in one place. It’s a great idea, but there are some big challenges that are worth considering.
Let me start by sharing a personal story. When the Google Maps API first came out I built a mash-up that allowed people to move around a map and see the names, addresses and phone numbers of everyone who lived in that area. I used a reverse geocoder to convert GPS coordinates to street addresses and then an address lookup service to get the information about the people who lived on each street. Since I didn’t have access to these databases myself, I extracted the data from several other websites. Everything worked great until someone found my site and posted a link on digg. About 10 minutes after my website made the front page of digg, two of the services I was using blocked my IP address. My 10 minutes of fame were over. My site was as dead as a doornail.
I learned an important lesson from this incident. When you pull content from another website you are ALWAYS at their mercy.
A good number of websites now provide API’s that allow third parties to interact with their data. Theoretically this makes the task of content aggregation more reliable since you have defined methods by which to access information from that provider. Unfortunately, even with API’s you are still defenseless to the actions of the API provider. For example, last week Facebook made several code updates that broke the majority of applications on their platform. I’ve experienced the same issues while using API’s for PayPal Pro and Google Maps. In each case, they made an “update” to their code which put my website out of business for a day. When you use an API, you must understand that it can (and will) change at any time. There will also be times when it will be broken or unavailable.
The bigger concern is when you need to aggregate content from websites that don’t have an API. The reality of the internet today is that API-offering websites make up a tiny percentage of all the websites on the internet. To provide a truly valuable aggregator you need to scrape data from all relevant websites, not just those that offer API access. The problem is that you are completely vulnerable when you are scraping data. Any time a change is made in their code, it has the potential to break everything you are doing. If they ever decide they don’t like you anymore, blocking you is as simple as adding your IP address to a restriction list. No matter what, you are at their mercy.
Keep in mind that few websites have an economic incentive to let you scrape their data. This is especially true for websites that make their money from advertising. When you scrap their content you are depriving them of their main source of revenue. If they ever decide they don’t like you, there’s not much you can do about it.
For those of you who are currently working on building content aggregators, I am curious to hear how you plan to address these issues. Is there another side to this that I am missing?