On Designing and Deploying Internet-Scale Services

Very interesting paper by James Hamilton. The paper gives a set of best practices for designing and developing operations-friendly services.

The system-to-administrator ratio is commonly used as a rough metric to understand adminis-
trative costs in high-scale services. With smaller, less automated services this ratio can be as low as
2:1, whereas on industry leading, highly automated services, we’ve seen ratios as high as 2,500:1.
Within Microsoft services, Autopilot [1] is often cited as the magic behind the success of the Win-
dows Live Search team in achieving high system-to-administrator ratios. While auto-administration
is important, the most important factor is actually the service itself. Is the service efficient to auto-
mate? Is it what we refer to more generally as operations-friendly? Services that are operations-
friendly require little human intervention, and both detect and recover from all but the most obscure
failures without administrative intervention. This paper summarizes the best practices accumulated
over many years in scaling some of the largest services at MSN and Windows Live.

Another real-time search and indexing system built on Apache Lucene.

From their website:

Zoie is a mature open source project and has been deployed in a real-time large-scale consumer website: LinkedIn.com handling millions of searches as well as hundreds of thousands of updates daily.

All Zoie releases have gone through extensive functional and performance testing by LinkedIn before made public. All major versions are released after a trial period on the production environment.

In a real-time search/indexing system, a document is made available as soon as it is added to the index. This functionality is especially important to time-sensitive information such as news, job openings, tweets etc.

Top 10 Internet Startup Scalability Killers

Strategies taken from The Art of Scalability:

1. Thinking Scalability Is Just About Technology;

2. Overuse of Synchronous Calls;

3. Failure to Weed or Seed Soon Enough;

4. Inappropriate Use of Databases;

5. Cesspools Instead of Swim Lanes;

6. Reliance on Vertical Scale;

7. Failure to Learn from History;

8. Changing Development Methodologies to Fix Problems;

9. Too Little Caching, Too Late;

10. Overreliance on Third Parties to Scale.