vikasgoel - Tumblr blog

vikasgoel · 10 years

Text

How should a 24-year-old invest time?

http://www.quora.com/Life-Advice/How-should-a-22-year-old-invest-time/answer/RyzVan-Rashid

By RyzWan Rashid:

When I was 22 years old, I spent time going out with my friends, talking to my girlfriend all the time, spending time smoking-up, and generally being an a$$.

10 years later I discovered a secret that helped me achieve and learn more in 1 year than I had in all the previous 10 years. Here's the secret for you to apply this in your life and get results and success that you've never even dreamed possible. Invest your time in the following activities and you will gain more power, money, friends, love and affection. Your friends will look up to you for your courage to do the things that they've only just dreamed about, your family will look up to you and be proud of your many achievements. Your colleagues will become jealous of your stratospheric success. They will look towards you for advice on how to become successful like you. If you don't do these things, 10 years down the line you will wonder where all your time went, what you did wrong to become such a failure. Your friends will forget you as a has-been, and your family will treat you like that 'bum-sibling' they have. Here are 5 things you should invest your time in everyday to become a success. But a WORD OF WARNING... If you try doing these things at the same time you will fail. So pick one of the things mentioned below. Learn how to do it. Figure it out. Then spend 1 month to learn how to do this better. Practice it every day for 1 month before taking on anything else. After the 1 month is over this will have become your habit and you will be able to do it unconsciously. Then when you add another thing from the list below it won't seem very hard. Within the year you'll be doing each of these things unconsciously and will have developed an arsenal of habits that will support you for the rest of your life. The Power Of Creating Good Habits The biggest secret that man has discovered over the last few decades is the power of habit. Once a habit is established it lasts for a life time. Developing a habit in the beginning seems hard, but once you master something and include it in your habits, this will be forgotten and you'll just do it easily. Think about tying your shoelaces. How do you do that? Do you loop each lace and then knot them, or do you loop one and then knot the other lace through the loop. Each of us ties our shoelaces differently. But once we get this, we'll tie them each time the same way, without even thinking about it. This is the power of habit. This is the first thing you should invest your time in. A habit is usually formed when you repeat something in the exact same manner everyday for at least 21 days. Some habits take longer, some shorter. So to average it out - invest 30 days to establish each habit. The 5 Most Powerful Habits That Anyone Can Learn These 5 habits will support you throughout your life Whether you decide to become an academic and do multiple Ph.Ds, or are an athlete looking to become professional. Whether you are a mother looking to give your kids a better life, or a high powered business woman seeking venture capital investment. The habits are universal. 1. Take Care Of Your Body No matter what you do in your life, you will do it in your body. You cannot replace it, get a new one, or trade it in. This is your body and your will live in it. It might not be the perfect body that you want, but this is the one you have. If you take care of it now - it will take care of you when you 60, 70 or 80 years old. The way to take care of your body is simple. Eat less & Exercise more. Spend 30 - 60 Minutes Each Day Exercising. This does not mean that you join a gym and start pumping weights. It means that you work every muscle in your body. Spend conscious time to move the muscles in your arms, back and legs. Take the time to join a gym and workout and use your muscles. Do this because most of us now spend more time sitting on couches than moving around. The human body was made to move around to do things. That's the first thing to take care of the body. Learn this habit first. Spend 30 - 60 minutes each day exercising. Eat Food That Is Fresh & Healthy The next step to take care of your body is to eat well. Treat your body like a home. This is the home where you live. If you bring good things to this home the home will become nicer and you'll enjoy living in it. If you bring rotten stuff into this home the home will decay and you'll hate living there. So eat good stuff. Eat as much fresh fruits & vegetables as you can everyday. Include a fruit and vegetables in your diet. This will give you the energy that you need. Play A Sport That You Like Everyday This is something that will help you relax your mind. That can get you moving physically and get you out of your head and present in the moment. Something that you do physically, like Tennis or Basketball. The benefit of these sports (or other sports) is that you get out of your head and get in the moment. The rush of the moment will help your mind relax for a minute and focus on the right now instead of being stuck on analyzing the past or estimating the future. Additionally the competition will rejuvenate you. Competition in life is how we grow. The more competition we have the sharper, faster, better we become. Unfortunately in the current job environment that we work in - we can't get immediate feedback on competition. Most of the time at work we are working in teams where we need the team to succeed. But our personal success is often based on the failure of other members on the team. For e.g. You can't become the manager of your team - if everyone in your team also becomes the manager of your team. So taking up a sport will give you the immediate rush of competition, the instant feedback of success or failure and you will grow. Avoid Junk Food Junk food as the name suggests is junk. It's useless, worthless for the body. It doesn't provide the nutrition that your body needs. If you feed your body with junk food, you will become lethargic, you will lose energy and over a period of time you will gain weight and lose health. Eat food that is free of preservatives, sugar, corn, syrup and overly loaded with salt and carbs. These sugar and carbs have an addictive effect on the body and will cause you to crave them more and more causing your body to become dependent on these foods. Good foods will make your body strong and healthy. Addictive foods will make your body lethargic and rotten. 2. Take Care of Your Mind We are living in a time when we do most of our work using our minds. We sit on desks and computers creating for people using our minds. Even if this isn't true for you, it is. If you are a construction worker who lays bricks, or a day care assistant who helps kids, this is true for you. Your mind is the one thing that controls your thoughts everyday and the thoughts that you think create the reality that you see around you. So take care of your mind. But how do you take care of your mind? The mind like anything else is has the characteristics of a muscle. You use it, or lose it. And as long as you are using it, it will remain fit and healthy. The minutes you stop using it, it will decay and rust. So you keep this muscle active by doing the following. Read Every Day There is no difference between the person who does not read the one who cannot read. Spend 30 minutes each morning and night reading. Read motivational books. Books about philosophy, economics, politics, literature. Read fiction. Read self-help books. Read about parenting, read about health. Read about science and technology. But read a book not a blog. A book has a very permanent nature. It is written with a lot of thought and research. It is the gist of an author's life experiences. So read a book everyday. This will keep your mind stimulated and open to ideas. You will get a number of ideas for each author that you can implement in your life. You will also get opinions from across the globe. Ideas that would not have reached you if you only spoke to the people you met everday. So read a book. Spend 30 minutes morning and night reading a book. Write Every Day The only way to bring your thoughts to reality is to write them down. If you don't write them down they will be lost to the electric impulses inside your brain. So write your thoughts down, write ideas that come to you, write your philosophy about life. Write ever day. This will help you clear your thought and formulate complete ideas. If you have a problem. Write it down. You will be able to come up with a solution better, once you've written down the problem. If you have a crush on someone. Write it down. Write down the things that you like about that person, how it makes you feel, what you would do for them. All these things when written down will help clarify your thoughts about love and life, about right and wrong. Writing will help identify truths about your thoughts and define how you think. As you progress in your writing, read about writing better. Then write better. The better you write, the better you will think. Develop Your Mind In Other Ways The more neural connections your mind has the better it is. The faster it can do things. The better it can fight against disease in old age, like Alzheimers. You can create new neural connections in your mind by doing new things. The more 'new' things you try the more your brain will become developed. Even if you suck at something - the experience of doing it, learning the rule, trying it out will develop your mind. Listen to and watch things that will develop you mind. Instead of watching TV, watch TED Talks. Instead of listening to 'distressing' music, listen to sweet, kind music. Listen to Mozart, listen to Beethoven. Plan to learn a new skill every year. Pick up playing an instrument one year. Spend time learning this instrument like this will be something that you will play all your life. The next year learn to play a new sport. Something that you've never tried before. This will both help your muscles improve and improve your mind. Avoid Junk In You Mind Just like your body needs good food to run, your mind needs good fuel to run. If you feed your mind with junk 'input' like mindless television, excessive drama, or constant news coverage whether TV or newspapers, your mind will become lethargic and fatigued. You will lose the will to do things, since your mind will associate doing things with depressing sad news. So avoid all news, whether TV or newspaper. Avoid junk TV like dramas on TV. Avoid excessive emotional drama on TV. This will give you space in your head to do things that will help build up your mind. So avoid junk food for your mind and give it the resources to build itself. 3. Take Care Of Your Relationships How do you get a friend, by being a friend. In life you are born with a very few people in your life. Your parents. Your siblings. Your grandparents if they are still alive. Your cousins if you are close to them. Every other relationship in your life you have to go out and create. You make friends along the way. Some of the good, some of them not so good. These friends that you make in your journey through life will become your support system. the people you meet everyday, from the grocer, to the kid working at Best Buy. Sure, at 24 you're probably thinking, 'who gives a f*&K about these people...' but down the line it is these people, these relationships that will matter most to you in your life. The way to take care of your relationships is as follows Remember Birthdays & Anniversaries Even if your friends tell you they don't celebrate birthdays. Even if your family become sullen when you call them for anniversaries. Remember them. Even though people say they don't care, everyone cares about their own special days. When you remember their birthday & anniversaries, they will remember your kindness. But it doesn't end there. Remember the special moments in their lives. If they had a kid, remember the date and call them, or write them a card on that occasion. Yes, even in the day of email, and text. A phonecall or a card have a HUGE impact than a Facebook message or text. Over the years your kindness for others in the form of remembering their special days will snowball and you will become a powerhouse. It good to know that some one cares about you - both for them and for you, when you are in trouble, or they are. Remember when they had hard times in their lives. If they were close to their grandparents and they recently passed away. Remember them on that day. Help your friends get over these moments. When you need them, when your loved ones leave you, your friends will be there to catch you before you fall. Forgive Them Before They Ask For Forgiveness In the long scheme of life small things don't matter. It doesn't matter if your friend forgot to tell you first about their new job. Or they didn't tell you about the girl they were proposing to. Be a gentleman and forgive them in your heart even before they ask for forgiveness. Then let the incident go. They will realize you are a big hearted person and treat you like that. But when you do this - don't resent them after. Really forget the incident and forgive them. This is more for you, then for them. If you keep holding onto every single hurt that any one has done for you then your baggage will become so heavy you won't be able to go through the door. You'll be stuck inside your own head and no one will want to be around you. No one will want to trip on your baggage. But if you do forgive them and forget the incidents you will be free. Your carefree nature will be reflective in everything that you do and everyone will want to be around you. Avoid Emotional Vampires No matter how good you are to people, occasionally there will be some who are vampires. They suck all the time and energy out of you. Sometimes you will encounter them at your workplace. Some of them will be your childhood friends, or even a member of your family. No matter what you do, you can't change them, you can't help them improve, you cannot guide them. So the best thing to do with people like this is to avoid them. Though it might hurt you in the beginning, but the best thing for you and them is to avoid them. You can be kind and make an excuse for not meeting them, but that will only last so long. So take the bigger step and let them know that they are an emotional strain on you - and you'd much rather hang out with more positive people. People who support your goals, your dreams, your aspiration. People who share your ambitions and values. These are the people who will really help you grow. The vampires will get hurt - but there is no better way to deal with them. The sooner you take care of them the better. Really dig deep and find out the people who bother you in your life like that and then stop meeting them and hanging out with them. CAUTION: If it turns out that everyone in your life seems to be a vampire you either need to change them all, or look inside yourself and change yourself. Most likely, the conclusion will be to change yourself and your attitude towards them. 4. Take Care Of Your Finances No matter how you grew up, in abundance or poverty, it is your duty to take care of your own finances. Even if you parents have taken care of them for you, even if you have a trust fund, even if you have an empty bank account. You are responsible for it. It is your responsibility to take care of your finances. If you take care of your finances starting today they will take care of you when you most need them. When you are old, or sick, or sending your kids to school, or helping a parent through sickness. Your finances will help you. If you don't take care of your finances you will end up in debt. Your shoulders will droop and your mind will be gripped by thoughts of money. You will end up living the life of an indebted servant, where you have to work to pay of your debts. But how do you take care of your finances Get Positive Cashflow To begin taking care of your finances you need to have more income than you spend. You need to have more money coming in to your bank account than you are spending. Most people don't learn this until after they get their 1st job or after they've maxed out their first credit card. As long as you have a positive cashflow you can get other things in your life a lot easier. If you don't have positive cashflow in your life spend the next year or two getting positive cashflow. Spend the next 15 minutes figuring out how much money you spend, include rent for the house you live in (or your contribution if you're living with parents), utilities, groceries, car, internet, other monthly expenses, insurance, and monthly spending on shopping & entertainment. Once you have your expenses take them out of your income. If you have no income - then you have a negative cashflow. Do whatever you can, teach other people, pick up a 2nd job, mow lawns. Whatever you have to do to get this to positive cash flow. What do you do after you have positive cashflows? Pay Yourself First Every cent you earn will be spent by other people for you. The government will want its cut in the form of taxes. The bank will ask for the mortgage payment, insurance, car payments... and so on. Until you don't have anything left in your account. So before this happens, pay yourself first. Get into a retirement scheme where they take 5 - 10% from your salary and put it into a gratuity and provident fund. Then when you get your salary, take another 5 - 10% and put them into another account. This account is your retirement account - also known as F*&K You money. This is the money that will give you the balls to say F*&K you to anyone you want to. Because you have this money sitting in your account - you won't become a slave to anyone, no one will control you. But a word of warning... You can't just get a small amount and then stop. This is a lifetime practice. You have to keep adding to this account. Until it become big enough that you can invest this into assets that make money for you. This is money that you only spend on assets that make money. Like a house you put up for rent, or government bonds, or dividend funds, or buying a profitable business etc. Financial Sinkholes To Avoid If anything sounds too good to be true - it probably is. A few things to avoid when you're taking care of your finances. The Stock Market - Most people will tell you to invest in it - but don't. Only invest in a S&P Index Fund at best if you really want to. Even though some friend of yours will tell you they have a big 'tip' on a stock that could make you millions - don't invest. Business Opportunity - If someone tells you that you can make a million dollars in 3 years by investing a small amount of money, or any other such scheme - don't listen to them. Shut the door, bang the phone, kick them out. If it's a friend - stop meeting him again. Lottery Schemes - If you ever hear that you've just won a cruise, or are tempted to buy a lottery ticket, or that you will earn X number of points. Run away - don't walk, run away. Credit Cards - Yes, even the lovely credit cards. These are probably the worst things invented since the dawn of time. Credit cards don't increase your spending power - they just make it seem that your spending power has increased. You still have to pay for what you bought, plus interest. A better alternative is to save for what you want - then when you have the cash then go buy the thing that you wanted to buy. This has 3 benefits 1) The thing will in all likelihood be cheaper by the time you've saved for it 2) There will be a newer shinier model that you can now buy based on the money you saved. 3) You will realize that you didn't want it in the first place anyway and were buying it on an impulse. These sinkholes mentioned are all just ways to get a single $1 out of you. Don't give them the dollar. It represents a part of you, of your life. You might think what's the big deal about a dollar, but a dollar properly invested can become the greatest fortune in the world. Read the story about how the Native Americans sold Manhattan for a $1, and how much that dollar would be worth today invested properly. Hint: It is worth more than the value of all the buildings, land, and businesses in those buildings on Manhattan - put together. 5. Take Care Of Your Communication The biggest problems in the world arise because of mis-communication. People mis-understand each other. Spouses fight because they don't understand what was being said. Employees get fired because of a communication error. Friends fight because of something that was mis-understood. Communication errors cause major problems in relationships between friends, employees, board members and even countries. So take care of your communication. Become a communication master. Become some one who can communicate clearly and effectively. Not just in your speaking, but in your writing, in your thoughts. But how do you improve your communication? Communicate At The 6th Grade Level Yes, at the 6th grade level. This is one of the most important thing you can do for your communication. If you can explain things to a 10 year old you can explain them to anyone. You might think that most 'educated' people will get turned off by this. But the truth is even most educated people think at the 6th grade level. When they are reading research papers, or grading Ph.D thesis will they get into the 'educated' mind and think this is stupid. Just by communicating at this level, your communication will be understood every time. Your kids will understand you, your parents will listen to you, your employers will 'get' you. In fact by communicating at this level every one around you will think you are wise that you are able to explain complex ideas in the simplest of manner. Learn The Vocabulary Of Whatever You Are Doing By learning the vocabulary of what you are doing you will learn faster. You will be understood quicker. Your responses will be on point. Every profession, sport, online forum, clique, had a different vocabulary. The faster you learn this vocabulary and use it in your conversations the quicker you will rise. If you play tennis, learn every thing that the pros are saying. Learn their meaning and then when you talk to your friends at tennis using the vocabulary will enhance your game. The same applies to your profession. The sooner you learn the vocabulary of the profession the faster you will progress. But this isn't a technique, use it to enhance your overall vocabulary. Putting It All To Work For You These habits when put in to action will enhance your life profoundly. You won't feel it when you turn 23, or even when you are 24. But as you progress as you spend more and more time in this, your results will multiply and compound. Each day that you spend doing these activities your results will increase ten fold. By the time you turn 30 you will have more friends who love you, more employers who want to hire you, and more energy than you can imagine possible. But on top of that because your life is built around a number of activities, not just your job, you will be more fulfilled and happier in life. I know even starting at 27 yrs old and implementing these in my life - by the time I turned 33 the results I was getting was more than I'd ever imagined. It still amazes me the way my life keeps changing every 6 months to a year. Every year. The results will speak for themselves when you apply these. P.S. Originally written for the question How should a 22-year-old invest time?but it applies more so here.

#life #gtd #productivity

3 notes · View notes

vikasgoel · 10 years

Text

How To Stop Being Lazy And Get More Done

Cal's five big tips: 1. To-do lists are evil. Schedule everything. 2. Assume you're going home at 5:30, then plan your day backwards. 3. Make a plan for the entire week. 4. Do very few things, but be awesome at them. 5. Less shallow work, focus on the deep stuff.

#productivity #gtd

2 notes · View notes

vikasgoel · 10 years

Video

youtube

We have heard this many times. To have faith, confidence and belief in our abilities. Good talk to reinforce the same.

0 notes

vikasgoel · 10 years

Video

7 Cliches To Never Use In A Job Interview http://www.businessinsider.com/heres-what-recruiters-look-at-during-the-6-seconds-they-spend-on-your-resume-2012-4#ooid=c5b25kbDp9Tt9EwhxiWkLF0i10dcpRp6

0 notes

vikasgoel · 10 years

Link

10 Tips For Being More Productive, From The CEO Of Red Hat Read more: http://www.inc.com/jeff-haden/10-simple-steps-to-exceptional-daily-productivity.html#ixzz2tjrGdMpq No matter what your job, in one wayour days are basically the same: We all have the same amount of time at our disposal. That's why how you use your time makes all the difference... whether you're bootstrapping a startup or running a billion-dollar company like Jim Whitehurst, the president and CEO of Red Hat, one of the largest and most successful providers of open source software. Here are Jim's tips for maximizing your time and improving your personal productivity: 1. Every Sunday night, map out your week. Sunday evenings, I sit down with my list of important objectives for the year and for each month. Those goals inform every week and help keep me on track. While long-range goals may not be urgent, they are definitely important. If you aren't careful it's easy for "important" to get pushed aside by "urgent." Then I look at my calendar for the week. I know what times are blocked out by meetings, etc. Then I look at what I want to accomplish and slot those tasks onto my to-do list. The key is to create structure and discipline for your week — otherwise you'll just let things come to you... and urgent will push aside important. 2. Actively block out task time. Everyone schedules meetings and appointments. Go a step farther and block out time to complete specific tasks. Slot periods for, "Write new proposal," or, "Craft presentation," or, "Review and approve marketing materials." If you don't proactively block out that time, those tasks will slip. Or get interrupted. Or you'll lose focus. And what is important for you to get done won't actually get done. 3. Follow a realistic to-do list. I used to create to-do lists, but I didn't assign times to each task. What happened? I always had more items on my to-do list than I could accomplish, and that turned it into a wish list, not a to-do list. If you have six hours of meetings scheduled today and eight hours worth of tasks then those tasks won't get done. Assigning realistic times forces you to prioritize. (I like Toodledo, but there are plenty of tools you can use.) Assigning realistic times also helps you stay focused; when you know a task should only take 30 minutes you'll be more aggressive in weeding out or ignoring distractions. 4. Default to 30-minute meetings. Whoever invented the one-hour default in calendar software wasted millions of people-hours. Most subjects can be handled in 30 minutes. Many can be handled in 15 minutes — especially if everyone who attends knows the meeting is only going to last 15 minutes. Don't be a slave to calendar tool defaults. Only schedule an hour if you absolutely know you need it. 5. Stop multitasking. During a meeting — especially an hour-long meeting — it's tempting to take care of a few mindless tasks. (Who hasn't cleaned up their inbox during a meeting?) The problem is that makes those meetings less productive. Even though you can only do mindless stuff, still — you're distracted. And that makes you less productive. Multitasking is a personal productivity killer. Don't try to do two things partly well. Do one thing really well. 6. Obsess over leveraging edge time. My biggest down times during the workday when I drive to work, when I drive home, and when I'm in airports. So I focus really hard on how to use that time. I almost always schedule calls for my drive to work. It's easy: I take the kids to school and drop them off at a specific time; then I can do an 8.00 to 8.30 call.) I typically don't schedule calls for the drive home so I can return calls, especially to people on the west coast. At the airport I use Pocket, a browser plugin that downloads articles. Loading up ten articles ahead of time ensures I have plenty to read — plenty I want to read — while I'm waiting in the security line. Look at your day. Identify the down times. Then schedule things you can do during that time. Call it edge time — because it really can build a productive edge. 7. Track your time. Once you start tracking your time (I use Toggl) you'll be amazed by how much time you spend doing stuff that isn't productive. You don't have to get hyper-specific. The info you log can be directional, not precise. Tracking my time is something I just started to do recently. It's been an eye-opening experience — and one that has really helped me focus. 8. Be thoughtful about lunch. Your lunch can take an hour. Or 30 minutes. Or 10 minutes. Whatever time it takes, be thoughtful about what you do. If you like to eat at your desk and keep chugging, fine. But if you benefit from using the break to recharge, lunch is one time where multitasking is great: you can network, socialize, help build your company's culture, but not if you're going out to lunch with the same people every day. Two days a week go out with people you don't know well. Or take a walk. Or do something personally productive. Say you take an hour for lunch each day; that's 5 hours a week. Be thoughtful about how you spend that time. You don't have to work, but you should make it work for you. 9. Obsess over protecting family time. Like you, I'm a bit of a workaholic. So I'm very thoughtful about my evenings. When I get home from work it's family time: we have dinner as a family, we help our kids with their homework. I completely shut down. No phone, no email. Generally speaking we have two hours before the kids have to get ready for bed. During that time, I'm there. Then I can switch back on. I'm comfortable leaving work at 5 or 5:30 p.m. because at 8 or 9 o'clock I know I will be able to re-engage with work. Every family has peak times when they can best interact. If you don't proactively free up that time you'll slip back into work stuff. Either be working or be home with your family. That means no phones at the table, no texts. Don't just be there, be with your family. 10. Start every day right. I exercise first thing in the morning because exercise is energizing. (Research also shows that moderate aerobic exercise can improve your mood for up to 12 hours, too.) I get up early and run. Then I cool off while I read the newspaper and am downstairs before my kids so I can eat breakfast with them. Not only will you get an energy boost, efficiency in the morning sets the stage for the rest of your day. Start your day productively and your entire day will be more productive, too.

#gtd #productivity

0 notes

vikasgoel · 10 years

Text

Best Practices for Designing a Pragmatic RESTful API by @veesahni

Your data model has started to stabilize and you're in a position to create a public API for your web app. You realize it's hard to make significant changes to your API once it's released and want to get as much right as possible up front. Now, the internet has no shortage on opinions on API design. But, since there's no one widely adopted standard that works in all cases, you're left with a bunch of choices: What formats should you accept? How should you authenticate? Should your API be versioned?

In designing an API for SupportFu (Customer support software for SaaS & eCommerce), I've tried to come up with pragmatic answers to these questions. My goal is for the SupportFu API to be easy to use, easy to adopt and flexible enough to dogfood for our own user interfaces.

An API is a user interface for a developer - so put some effort into making it pleasant Use RESTful URLs and actions Use SSL everywhere, no exceptions An API is only as good as its documentation - so have great documentation Version via the URL, not via headers Use query parameters for advanced filtering, sorting & searching Provide a way to limit which fields are returned from the API Return something useful from POST, PATCH & PUT requests HATEOAS isn't practical just yet Use JSON where possible, XML only if you have to You should use camelCase with JSON, but snake_case is 20% easier to read Pretty print by default & ensure gzip is supported Don't use response envelopes by default Consider using JSON for POST, PUT and PATCH request bodies Paginate using Link headers Provide a way to autoload related resource representations Provide a way to override the HTTP method Provide useful response headers for rate limiting Use token based authentication, transported over OAuth2 where delegation is needed Include response headers that facilitate caching Define a consumable error payload Effectively use HTTP Status codes

Key requirements for the API

Many of the API design opinions found on the web are academic discussions revolving around subjective interpretations of fuzzy standards as opposed to what makes sense in the real world. My goal with this post is to describe best practices for a pragmatic API designed for today's web applications. I make no attempt to satisfy a standard if it doesn't feel right. To help guide the decision making process, I've written down some requirements that the API must strive for:

It should use web standards where they make sense

It should be friendly to the developer and be explorable via a browser address bar

It should be simple, intuitive and consistent to make adoption not only easy but pleasant

It should provide enough flexibility to power majority of the SupportFu UI

It should be efficient, while maintaining balance with the other requirements

An API is a developer's UI - just like any UI, it's important to ensure the user's experience is thought out carefully!

Use RESTful URLs and actions

If there's one thing that has gained wide adoption, it's RESTful principles. These were first introduced byRoy Felding in Chapter 5 of his dissertation on network based software architectures.

The key principles of REST involve separating your API into logical resources. These resources are manipulated using HTTP requests where the method (GET, POST, PUT, PATCH, DELETE) has specific meaning.

But what can I make a resource? Well, these should be nouns (not verbs!) that make sense from the perspective of the API consumer. Although your internal models may map neatly to resources, it isn't necessarily a one-to-one mapping. The key here is to not leak irrelevant implementation details out to your API! Some of SupportFu's nouns would be ticket, user and group.

Once you have your resources defined, you need to identify what actions apply to them and how those would map to your API. RESTful principles provide strategies to handle CRUD actions using HTTP methods mapped as follows:

GET /tickets - Retrieves a list of tickets

GET /tickets/12 - Retrieves a specific ticket

POST /tickets - Creates a new ticket

PUT /tickets/12 - Updates ticket #12

PATCH /tickets/12 - Partially updates ticket #12

DELETE /tickets/12 - Deletes ticket #12

The great thing about REST is that you're leveraging existing HTTP methods to implement significant functionality on just a single /tickets endpoint. There are no method naming conventions to follow and the URL structure is clean & clear. REST FTW!

Should the endpoint name be singular or plural? The keep-it-simple rule applies here. Although your inner-grammatician will tell you it's wrong to describe a single instance of a resource using a plural, the pragmatic answer is to keep the URL format consistent and always use a plural. Not having to deal with odd pluralization (person/people, goose/geese) makes the life of the API consumer better and is easier for the API provider to implement (as most modern frameworks will natively handle /tickets and/tickets/12 under a common controller).

But how do you deal with relations? If a relation can only exist within another resource, RESTful principles provide useful guidance. Let's look at this with an example. A ticket in SupportFu consists of a number of messages. These messages can be logically mapped to the /tickets endpoint as follows:

GET /tickets/12/messages - Retrieves list of messages for ticket #12

GET /tickets/12/messages/5 - Retrieves message #5 for ticket #12

POST /tickets/12/messages - Creates a new message in ticket #12

PUT /tickets/12/messages/5 - Updates message #5 for ticket #12

PATCH /tickets/12/messages/5 - Partially updates message #5 for ticket #12

DELETE /tickets/12/messages/5 - Deletes message #5 for ticket #12

Alternatively, if a relation can exist independently of the resource, it makes sense to just include an identifier for it within the output representation of the resource. The API consumer would then have to hit the relation's endpoint. However, if the relation is commonly requested alongside the resource, the API could offer functionality to automatically embed the relation's representation and avoid the second hit to the API.

What about actions that don't fit into the world of CRUD operations?

This is where things can get fuzzy. There are a number of approaches:

Restructure the action to appear like a field of a resource. This works if the action doesn't take parameters. For example an activate action could be mapped to a boolean activated field and updated via a PATCH to the resource.

Treat it like a sub-resource with RESTful principles. For example, GitHub's API lets you star a gistwith PUT /gists/:id/star and unstar with DELETE /gists/:id/star.

Sometimes you really have no way to map the action to a sensible RESTful structure. For example, a multi-resource search doesn't really make sense to be applied to a specific resource's endpoint. In this case, /search would make the most sense even though it isn't a noun. This is OK - just do what's right from the perspective of the API consumer and make sure it's documented clearly to avoid confusion.

SSL everywhere - all the time

Always use SSL. No exceptions. Today, your web APIs can get accessed from anywhere there is internet (like libraries, coffee shops, airports among others). Not all of these are secure. Many don't encrypt communications at all, allowing for easy eavesdropping or impersonation if authentication credentials are hijacked.

Another advantage of always using SSL is that guaranteed encrypted communications simplifies authentication efforts - you can get away with simple access tokens instead of having to sign each API request.

One thing to watch out for is non-SSL access to API URLs. Do not redirect these to their SSL counterparts. Throw a hard error instead! The last thing you want is for poorly configured clients to send requests to an unencrypted endpoint, just to be silently redirected to the actual encrypted endpoint.

Documentation

An API is only as good as its documentation. The docs should be easy to find and publically accessible. Most developers will check out the docs before attempting any integration effort. When the docs are hidden inside a PDF file or require signing in, they're not only difficult to find but also not easy to search.

The docs should show examples of complete request/response cycles. Preferably, the requests should be pastable examples - either links that can be pasted into a browser or curl examples that can be pasted into a terminal. GitHub and Stripe do a great job with this.

Once you release a public API, you've committed to not breaking things without notice. The documentation must include any deprecation schedules and details surrounding externally visible API updates. Updates should be delivered via a blog (i.e. a changelog) or a mailing list (preferably both!).

Versioning

Always version your API. Versioning helps you iterate faster and prevents invalid requests from hitting updated endpoints. It also helps smooth over any major API version transitions as you can continue to offer old API versions for a period of time.

There are mixed opinions around whether an API version should be included in the URL or in a header. Academically speaking, it should probably be in a header. However, the version needs to be in the URL to ensure browser explorability of the resources across versions (remember the API requirements specified at the top of this post?).

I'm a big fan of the approach that Stripe has taken to API versioning - the URL has a major version number (v1), but the API has date based sub-versions which can be chosen using a custom HTTP request header. In this case, the major version provides structural stability of the API as a whole while the sub-versions accounts for smaller changes (field deprecations, endpoint changes, etc).

An API is never going to be completely stable. Change is inevitable. What's important is how that change is managed. Well documented and announced multi-month deprecation schedules can be an acceptable practice for many APIs. It comes down to what is reasonable given the industry and possible consumers of the API.

Result filtering, sorting & searching

It's best to keep the base resource URLs as lean as possible. Complex result filters, sorting requirements and advanced searching (when restricted to a single type of resource) can all be easily implemented as query parameters on top of the base URL. Let's look at these in more detail:

Filtering: Use a unique query parameter for each field that implements filtering. For example, when requesting a list of tickets from the /tickets endpoint, you may want to limit these to only those in the open state. This could be accomplished with a request like GET /tickets?state=open. Here, state is a query parameter that implements a filter.

Sorting: Similar to filtering, a generic parameter sort can be used to describe sorting rules. Accommodate complex sorting requirements by letting the sort parameter take in list of comma separated fields, each with a possible unary negative to imply descending sort order. Let's look at some examples:

GET /tickets?sort=-priority - Retrieves a list of tickets in descending order of priority

GET /tickets?sort=-priority,created_at - Retrieves a list of tickets in descending order of priority. Within a specific priority, older tickets are ordered first

Searching: Sometimes basic filters aren't enough and you need the power of full text search. Perhaps you're already using ElasticSearch or another Lucene based search technology. When full text search is used as a mechanism of retrieving resource instances for a specific type of resource, it can be exposed on the API as a query parameter on the resource's endpoint. Let's say q. Search queries should be passed straight to the search engine and API output should be in the same format as a normal list result.

Combining these together, we can build queries like:

GET /tickets?sort=-updated_at - Retrieve recently updated tickets

GET /tickets?state=closed&sort=-updated_at - Retrieve recently closed tickets

GET /tickets?q=return&state=open&sort=-priority,created_at - Retrieve the highest priority open tickets mentioning the word 'return'

Aliases for common queries

To make the API experience more pleasant for the average consumer, consider packaging up sets of conditions into easily accessible RESTful paths. For example, the recently closed tickets query above could be packaged up as GET /tickets/recently_closed

Limiting which fields are returned by the API

The API consumer doesn't always need the full representation of a resource. The ability select and chose returned fields goes a long way in letting the API consumer minimize network traffic and speed up their own usage of the API.

Use a fields query parameter that takes a comma separated list of fields to include. For example, the following request would retrieve just enough information to display a sorted listing of open tickets:

GET /tickets?fields=id,subject,customer_name,updated_at&state=open&sort=-updated_at

Updates & creation should return a resource representation

A PUT, POST or PATCH call may make modifications to fields of the underlying resource that weren't part of the provided parameters (for example: created_at or updated_at timestamps). To prevent an API consumer from having to hit the API again for an updated representation, have the API return the updated (or created) representation as part of the response.

In case of a POST that resulted in a creation, use a HTTP 201 status code and include a Location headerthat points to the URL of the new resource.

Should you HATEOAS?

There are a lot of mixed opinions as to whether the API consumer should create links or whether links should be provided to the API. RESTful design principles specify HATEOAS which roughly states that interaction with an endpoint should be defined within metadata that comes with the output representation and not based on out-of-band information.

Although the web generally works on HATEOAS type principles (where we go to a website's front page and follow links based on what we see on the page), I don't think we're ready for HATEOAS on APIs just yet. When browsing a website, decisions on what links will be clicked are made at run time. However, with an API, decisions as to what requests will be sent are made when the API integration code is written, not at run time. Could the decisions be deferred to run time? Sure, however, there isn't much to gain going down that route as code would still not be able to handle significant API changes without breaking. That said, I think HATEOAS is promising but not ready for prime time just yet. Some more effort has to be put in to define standards and tooling around these principles for its potential to be fully realized.

For now, it's best to assume the user has access to the documentation & include resource identifiers in the output representation which the API consumer will use when crafting links. There are a couple of advantages of sticking to identifiers - data flowing over the network is minimized and the data stored by API consumers is also minimized (as they are storing small identifiers as opposed to URLs that contain identifiers).

Also, given this post advocates version numbers in the URL, it makes more sense in the long term for the API consumer to store resource identifiers as opposed to URLs. After all, the identifier is stable across versions but the URL representing it is not!

JSON only responses

It's time to leave XML behind in APIs. It's verbose, it's hard to parse, it's hard to read, its data model isn't compatible with how most programming languages model data and its extendibility advantages are irrelevant when your output representation's primary needs are serialization from an internal representation.

I'm not going to put much effort into explaining the above breathful as it looks like others (YouTube,Twitter & Box) have already started the XML exodus.

I'll just leave you the following Google Trends chart (XML API vs JSON API) as food for thought:

However, if your customer base consists of a large number of enterprise customers, you may find yourself having to support XML anyway. If you must do this, you'll find yourself with a new question:

Should the media type change based on Accept headers or based on the URL? To ensure browser explorability, it should be in the URL. The most sensible option here would be to append a .json or.xml extension to the endpoint URL.

snake_case vs camelCase for field names

If you're using JSON (JavaScript Object Notation) as your primary representation format, the "right" thing to do is to follow JavaScript naming conventions - and that means camelCase for field names! If you then go the route of building client libraries in various languages, it's best to use idiomatic naming conventions in them - camelCase for C# & Java, snake_case for python & ruby.

Food for thought: I've always felt that snake_case is easier to read than JavaScript's convention ofcamelCase. I just didn't have any evidence to back up my gut feelings, until now. Based on an eye tracking study on camelCase and snake_case (PDF) from 2010, snake_case is 20% easier to read than camelCase! That impact on readability would affect API explorability and examples in documentation.

Many popular JSON APIs use snake_case. I suspect this is due to serialization libraries following naming conventions of the underlying language they are using. Perhaps we need to have JSON serialization libraries handle naming convention transformations.

Pretty print by default & ensure gzip is supported

An API that provides white-space compressed output isn't very fun to look at from a browser. Although some sort of query parameter (like ?pretty=true) could be provided to enable pretty printing, an API that pretty prints by default is much more approachable. The cost of the extra data transfer is negligible, especially when you compare to the cost of not implementing gzip.

Consider some use cases: What if an API consumer is debugging and has their code print out data it received from the API - It will be readable by default. Or if the consumer grabbed the URL their code was generating and hit it directly from the browser - it will be readable by default. These are small things. Small things that make an API pleasant to use!

But what about all the extra data transfer?

Let's look at this with a real world example. I've pulled some data from GitHub's API, which uses pretty print by default. I'll also be doing some gzip comparisons:

$ curl https://api.github.com/users/veesahni > with-whitespace.txt $ ruby -r json -e 'puts JSON JSON.parse(STDIN.read)' < with-whitespace.txt > without-whitespace.txt $ gzip -c with-whitespace.txt > with-whitespace.txt.gz $ gzip -c without-whitespace.txt > without-whitespace.txt.gz

The output files have the following sizes:

without-whitespace.txt - 1252 bytes

with-whitespace.txt - 1369 bytes

without-whitespace.txt.gz - 496 bytes

with-whitespace.txt.gz - 509 bytes

In this example, the whitespace increased the output size by 8.5% when gzip is not in play and 2.6% when gzip is in play. On the other hand, the act of gzipping in itself provided over 60% in bandwidth savings. Since the cost of pretty printing is relatively small, it's best to pretty print by default and ensure gzip compression is supported!

To further hammer in this point, Twitter found that there was an 80% savings (in some cases) when enabling gzip compression on their Streaming API. Stack Exchange went as far as to never return a response that's not compressed!

Don't use an envelope by default, but make it possible when needed

Many APIs wrap their responses in envelopes like this:

{ "data" : { "id" : 123, "name" : "John" } }

There are a couple of justifications for doing this - it makes it easy to include additional metadata or pagination information, some REST clients don't allow easy access to HTTP headers & JSONP requests have no access to HTTP headers. However, with standards that are being rapidly adopted like CORS and the Link header from RFC 5988, enveloping is starting to become unnecessary.

We can future proof the API by staying envelope free by default and enveloping only in exceptional cases.

How should an envelope be used in the exceptional cases?

There are 2 situations where an envelope is really needed - if the API needs to support cross domain requests over JSONP or if the client is incapable of working with HTTP headers.

JSONP requests come with an additional query parameter (usually named callback or jsonp) representing the name of the callback function. If this parameter is present, the API should switch to a full envelope mode where it always responds with a 200 HTTP status code and passes the real status code in the JSON payload. Any additional HTTP headers that would have been passed alongside the response should be mapped to JSON fields, like so:

callback_function({ status_code: 200, next_page: "https://..", response: { ... actual JSON response body ... } })

Similarly, to support limited HTTP clients, allow for a special query parameter ?envelope=true that would trigger full enveloping (without the JSONP callback function).

JSON encoded POST, PUT & PATCH bodies

If you're following the approach in this post, then you've embraced JSON for all API output. Let's consider JSON for API input.

Many APIs use URL encoding in their API request bodies. URL encoding is exactly what it sounds like - request bodies where key value pairs are encoded using the same conventions as one would use to encode data in URL query parameters. This is simple, widely supported and gets the job done.

However, URL encoding has a few issues that make it problematic. It has no concept of data types. This forces the API to parse integers and booleans out of strings. Furthermore, it has no real concept of hierarchical structure. Although there are some conventions that can build some structure out of key value pairs (like appending [ ] to a key to represent an array), this is no comparison to the native hierarchical structure of JSON.

If the API is simple, URL encoding may suffice. However, complex APIs should stick to JSON for their API input. Either way, pick one and be consistent throughout the API.

An API that accepts JSON encoded POST, PUT & PATCH requests should also require the Content-Type header be set to application/json or throw a 415 Unsupported Media Type HTTP status code.

Pagination

Envelope loving APIs typically include pagination data in the envelope itself. And I don't blame them - until recently, there weren't many better options. The right way to include pagination details today is using the Link header introduced by RFC 5988.

An API that uses the Link header can return a set of ready-made links so the API consumer doesn't have to construct links themselves. This is especially important when pagination is cursor based. Here is an example of a Link header used properly, grabbed from GitHub's documentation:

Link: <https://api.github.com/user/repos?page=3&per_page=100>; rel="next", <https://api.github.com/user/repos?page=50&per_page=100>; rel="last"

But this isn't a complete solution as many APIs do like to return the additional pagination information, like a count of the total number of available results. An API that requires sending a count can use a custom HTTP header like X-Total-Count.

Auto loading related resource representations

There are many cases where an API consumer needs to load data related to (or referenced) from the resource being requested. Rather than requiring the consumer to hit the API repeatedly for this information, there would be a significant efficiency gain from allowing related data to be returned and loaded alongside the original resource on demand.

However, as this does go against some RESTful principles, we can minimize our deviation by only doing so based on an embed (or expand) query parameter.

In this case, embed would be a comma separated list of fields to be embedded. Dot-notation could be used to refer to sub-fields. For example:

GET /ticket/12?embed=customer.name,assigned_user

This would return a ticket with additional details embedded, like:

{ "id" : 12, "subject" : "I have a question!", "summary" : "Hi, ....", "customer" : { "name" : "Bob" }, assigned_user: { "id" : 42, "name" : "Jim", } }

Of course, ability to implement something like this really depends on internal complexity. This kind of embedding can easily result in an N+1 select issue.

Overriding the HTTP method

Some HTTP clients can only work with simple GET and POST requests. To increase accessibility to these limited clients, the API needs a way to override the HTTP method. Although there aren't any hard standards here, the popular convention is to accept a request header X-HTTP-Method-Override with a string value containing one of PUT, PATCH or DELETE.

Note that the override header should only be accepted on POST requests. GET requests should neverchange data on the server!

Rate limiting

To prevent abuse, it is standard practice to add some sort of rate limiting to an API. RFC 6585introduced a HTTP status code 429 Too Many Requests to accommodate this.

However, it can be very useful to notify the consumer of their limits before they actually hit it. This is an area that currently lacks standards but has a number of popular conventions using HTTP response headers.

At a minimum, include the following headers (using Twitter's naming conventions as headers typically don't have mid-word capitalization):

X-Rate-Limit-Limit - The number of allowed requests in the current period

X-Rate-Limit-Remaining - The number of remaining requests in the current period

X-Rate-Limit-Reset - The number of seconds left in the current period

Why is number of seconds left being used instead of a time stamp for X-Rate-Limit-Reset?

A timestamp contains all sorts of useful but unnecessary information like the date and possibly the time-zone. An API consumer really just wants to when they can send the request again & the number of seconds answers this question with minimal additional processing on their end. It also avoids issues related to clock skew.

Some APIs use a UNIX timestamp (seconds since epoch) for X-Rate-Limit-Reset. Don't do this!

Why is it bad practice to use a UNIX timestamp for X-Rate-Limit-Reset?

The HTTP spec already specifies using RFC 1123 date formats (currently being used in Date, If-Modified-Since & Last-Modified HTTP headers). If we were to specify a new HTTP header that takes a timestamp of some sort, it should follow RFC 1123 conventions instead of using UNIX timestamps.

Authentication

A RESTful API should be stateless. This means that request authentication should not depend on cookies or sessions. Instead, each request should come with some sort authentication credentials.

By always using SSL, the authentication credentials can be simplified to a randomly generated access token that is delivered in the user name field of HTTP Basic Auth. The great thing about this is that it's completely browser explorable - the browser will just popup a prompt asking for credentials if it receives a 401 Unauthorized status code from the server.

However, this token-over-basic-auth method of authentication is only acceptable in cases where it's practical to have the user copy a token from an administration interface to the API consumer environment. In cases where this isn't possible, OAuth 2 should be used to provide secure token transfer to a third party. OAuth 2 uses Bearer tokens & also depends on SSL for its underlying transport encryption.

An API that needs to support JSONP will need a third method of authentication, as JSONP requests cannot send HTTP Basic Auth credentials or Bearer tokens. In this case, a special query parameteraccess_token can be used. Note: there is an inherent security issue in using a query parameter for the token as most web servers store query parameters in server logs.

For what it's worth, all three methods above are just ways to transport the token across the API boundary. The actual underlying token itself could be identical.

Caching

HTTP provides a built-in caching framework! All you have to do is include some additional outbound response headers and do a little validation when you receive some inbound request headers.

There are 2 approaches: ETag and Last-Modified

ETag: When generating a request, include a HTTP header ETag containing a hash or checksum of the representation. This value should change whenever the output representation changes. Now, if an inbound HTTP requests contains a If-None-Match header with a matching ETag value, the API should return a 304 Not Modified status code instead of the output representation of the resource.

Last-Modified: This basically works like to ETag, except that it uses timestamps. The response headerLast-Modified contains a timestamp in RFC 1123 format which is validated against If-Modified-Since. Note that the HTTP spec has had 3 different acceptable date formats and the server should be prepared to accept any one of them.

Errors

Just like an HTML error page shows a useful error message to a visitor, an API should provide a useful error message in a known consumable format. The representation of an error should be no different than the representation of any resource, just with its own set of fields.

The API should always return sensible HTTP status codes. API errors typically break down into 2 types: 400 series status codes for client issues & 500 series status codes for server issues. At a minimum, the API should standardize that all 400 series errors come with consumable JSON error representation. If possible (i.e. if load balancers & reverse proxies can create custom error bodies), this should extend to 500 series status codes.

A JSON error body should provide a few things for the developer - a useful error message, a unique error code (that can be looked up for more details in the docs) and possibly a detailed description. JSON output representation for something like this would look like:

{ "code" : 1234, "message" : "Something bad happened :(", "description" : "More details about the error here" }

Validation errors for PUT, PATCH and POST requests will need a field breakdown. This is best modeled by using a fixed top-level error code for validation failures and providing the detailed errors in an additional errors field, like so:

{ "code" : 1024, "message" : "Validation Failed", "errors" : [ { "code" : 5432, "field" : "first_name", "message" : "First name cannot have fancy characters" }, { "code" : 5622, "field" : "password", "message" : "Password cannot be blank" } ] }

HTTP status codes

HTTP defines a bunch of meaningful status codes that can be returned from your API. These can be leveraged to help the API consumers route their responses accordingly. I've curated a short list of the ones that you definitely should be using:

200 OK - Response to a successful GET, PUT, PATCH or DELETE. Can also be used for a POST that doesn't result in a creation.

201 Created - Response to a POST that results in a creation. Should be combined with a Location header pointing to the location of the new resource

204 No Content - Response to a successful request that won't be returning a body (like a DELETE request)

304 Not Modified - Used when HTTP caching headers are in play

400 Bad Request - The request is malformed, such as if the body does not parse

401 Unauthorized - When no or invalid authentication details are provided. Also useful to trigger an auth popup if the API is used from a browser

403 Forbidden - When authentication succeeded but authenticated user doesn't have access to the resource

404 Not Found - When a non-existent resource is requested

405 Method Not Allowed - When an HTTP method is being requested that isn't allowed for the authenticated user

410 Gone - Indicates that the resource at this end point is no longer available. Useful as a blanket response for old API versions

415 Unsupported Media Type - If incorrect content type was provided as part of the request

422 Unprocessable Entity - Used for validation errors

429 Too Many Requests - When a request is rejected due to rate limiting

In Summary

An API is a user interface for developers. Put the effort in to ensure it's not just functional but pleasant to use.

#REST #API #Appdev

0 notes

vikasgoel · 10 years

Link

A good article on building a RESTful API by @PuerkitoBio :

Build a RESTful API with Martini

I’ve been looking for an excuse to try Martini ever since it was announced on the golang-nuts mailing list. Martini is a Go package for web server development that skyrocketed to close to 2000 stars on GitHub in just a few weeks (the first public commit is a month old). So I decided to build an example application that implements a (pragmatic) RESTful API, based on mostly-agreed-upon best practices. The companion code for this article can be found on GitHub.

Why Martini?

Martini has many things going for it, chief among them is the very elegant API using only a thin layer of abstraction over the stdlib’s excellent net/http package, and the fact that it understands the ubiquitous http.Handler interface.

Another key point is that if Martini ever feels magical to you (me no like magic), you absolutely should peek at the code. It is a very slim and manageable ~400 lines of code (it was this morning anyway), with a single external dependency, theinject package, another skinny ~100 lines of code.

Please note that it is currently under active development, and some examples may be broken due to recent changes. I’ll try to keep my source code repository up-to-date.

Use cases

The example application will expose a single resource, music albums, under the/albums URI. It will support:

GET /albums : list all available albums, with optional filtering based on band, title or year using query string parameters

GET /albums/:id : fetch a specific album

POST /albums : create an album

PUT /albums/:id : update an album

DELETE /albums/:id : delete an album.

To make things interesting, responses can be requested in JSON, XML or plain text. Response format will be determined based on the endpoint’s extension (.json,.xml or .text, defaulting to JSON).

Because the data store is not the goal of the app, a (read-write) mutex-controlled map is used as in-memory “database”, and it implements an interface that defines the required actions:

type DB interface { Get(id int) *Album GetAll() []*Album Find(band, title string, year int) []*Album Add(a *Album) (int, error) Update(a *Album) error Delete(id int) }

The album data structure is defined like this:

type Album struct { XMLName xml.Name `json:"-" xml:"album"` Id int `json:"id" xml:"id,attr"` Band string `json:"band" xml:"band"` Title string `json:"title" xml:"title"` Year int `json:"year" xml:"year"` } func (a *Album) String() string { return fmt.Sprintf("%s - %s (%d)", a.Band, a.Title, a.Year) }

The field tags control the marshaling of the structure to JSON and XML. The XMLNamefield gives the structure its element name in XML, and is ignored in JSON. The Idfield is set as an attribute in XML. The other tags simply give a lower-cased name to the serialized fields. The plain text format will use the fmt.Stringer interface - that is, the func String() string function.

With this out of the way, let’s see how Martini is actually used.

DRY Martini

At the heart of the martini package is the martini.Martini type, which implements the http.Handler interface, so that it can be passed to http.ListenAndServe() like any stdlib handler. Another important notion is that martini uses a middleware-based approach - meaning that you can configure a list of functions to be called in a specific order before the actual route handler is executed. This is very useful to setup things like logging, authentication, session management, etc., and it helps keep things DRY.

The package provides the martini.Classic() function that creates an instance with sane defaults - common middleware like panic recovery, logging and static file support. This is great for a web site, but for an API, we don’t care much about serving static pages, so we won’t use the Classic Martini.

Thankfully this is just a convenience function, it is always possible to create a bare Martini instance and configure it manually as needed. Our version looks like this:

var m *martini.Martini func init() { m = martini.New() // Setup middleware m.Use(martini.Recovery()) m.Use(martini.Logger()) m.Use(auth.Basic(AuthToken, "")) m.Use(MapEncoder) // Setup routes r := martini.NewRouter() r.Get(`/albums`, GetAlbums) r.Get(`/albums/:id`, GetAlbum) r.Post(`/albums`, AddAlbum) r.Put(`/albums/:id`, UpdateAlbum) r.Delete(`/albums/:id`, DeleteAlbum) // Inject database m.MapTo(db, (*DB)(nil)) // Add the router action m.Action(r.Handle) }

The panic recovery and logger middleware are fairly obvious. auth.Basic() is a handler provided by the userland add-ons repository martini-contrib, and it is a little bit too naive for an API, since we can only feed it a single username-password, and all requests are checked against this tuple. In a more realistic scenario, we would need to support any number of valid access tokens, so this basic auth handler wouldn’t fit.

Let’s skip over the MapEncoder middleware for now, we’ll come back to it in a minute. The next step is to setup the routes, and martini provides a nice clean way to do this. It supports placeholders for parameters, and you can even throw some regular expressions in there, that’s how the path will end up anyway. The second parameter to the Get, Post, Put and co. is the handler to call for this route. Many handlers can be passed on the same route definition (this is a variadic parameter), and they will be executed in order, until one of them writes a response.

Then we define a global dependency. This is a very neat feature of martini (wait, or is it icky ? :-), it supports global- and request-scoped dependencies, and when it encounters a handler (middleware function or route handler) that asks for a parameter of that type, the dependency injector will feed it the right value. In this case, m.MapTo() maps the db package variable (the instance of our in-memory database) to the DB interface that we defined earlier. This particular case doesn’t get much added value versus using the thread-safe, package-global db variable directly, but in other cases (like the encoder, see below) it can prove very useful.

The syntax for the second parameter may seem weird, it is just converting nil to the pointer-to-DB-interface type, because all the injector needs is the type to map the first parameter to.

The final step, m.Action(), adds the router’s configuration to the list of handlers that Martini will call.

The MapEncoder middleware

Back to the MapEncoder middleware function, it sets the Encoder interface dependency for the current request based on the requested encoding format:

// An Encoder implements an encoding format of values to be sent as response to // requests on the API endpoints. type Encoder interface { Encode(v ...interface{}) (string, error) } // The regex to check for the requested format (allows an optional trailing // slash). var rxExt = regexp.MustCompile(`(\.(?:xml|text|json))\/?$`) // MapEncoder intercepts the request's URL, detects the requested format, // and injects the correct encoder dependency for this request. It rewrites // the URL to remove the format extension, so that routes can be defined // without it. func MapEncoder(c martini.Context, w http.ResponseWriter, r *http.Request) { // Get the format extension matches := rxExt.FindStringSubmatch(r.URL.Path) ft := ".json" if len(matches) > 1 { // Rewrite the URL without the format extension l := len(r.URL.Path) - len(matches[1]) if strings.HasSuffix(r.URL.Path, "/") { l-- } r.URL.Path = r.URL.Path[:l] ft = matches[1] } // Inject the requested encoder switch ft { case ".xml": c.MapTo(xmlEncoder{}, (*Encoder)(nil)) w.Header().Set("Content-Type", "application/xml") case ".text": c.MapTo(textEncoder{}, (*Encoder)(nil)) w.Header().Set("Content-Type", "text/plain; charset=utf-8") default: c.MapTo(jsonEncoder{}, (*Encoder)(nil)) w.Header().Set("Content-Type", "application/json") } }

Here, the MapTo() injector function is called on a martini.Context, meaning that the dependency is scoped to this particular request. The code also sets the correct “Content-Type” header, since even errors will be returned using the requested format.

The route handlers

I won’t go into details of all route handlers (they are defined in the file api.go in the example repository), but I’ll show one to talk about how martini handles the return values. The single-album GET handler looks like this:

func GetAlbum(enc Encoder, db DB, parms martini.Params) (int, string) { id, err := strconv.Atoi(parms["id"]) al := db.Get(id) if err != nil || al == nil { return http.StatusNotFound, Must(enc.Encode( NewError(ErrCodeNotExist, fmt.Sprintf("the album with id %s does not exist", parms["id"])))) } return http.StatusOK, Must(enc.Encode(al)) }

First, we can see that martini.Params can be used as parameter to get a map of named parameters defined on the route pattern. If the id is not an integer, or if it doesn’t exist in the database (I know for a fact that there’s no id=0 in the database, which is why I use this dual-purpose if), a 404 status code is returned with a correctly-encoded error message. Note the use of the Must() function, since we have a Recovery() middleware that will trap panics and return a 500, we can safely panic in case of server-side errors. More serious projects would probably want to return a custom message along with the 500, though.

Finally, if all goes well, a code 200 is returned, along with the encoded album. If a route handler returns two values, and the first is an int, Martini will use this first value as the status code, and will write the second value as a string to thehttp.ResponseWriter. If the first value is not an int, or if there is only one return value, it will write the first value to the http.ResponseWriter.

curl calls

Let’s see how it goes with some actual calls to the API.

$ curl -i -k -u token: "https://localhost:8001/albums" HTTP/1.1 200 OK Content-Type: application/json Date: Wed, 27 Nov 2013 02:31:46 GMT Content-Length: 201 [{"id":1,"band":"Slayer","title":"Reign In Blood","year":1986},{"id":2,"band":"Slayer","title":"Seasons In The Abyss","year":1990},{"id":3,"band":"Bruce Springsteen","title":"Born To Run","year":1975}]

The -k option is required if you use a self-signed certificate. The -u option specifies the user:password, which in our case is simply token: (empty password). The -i option prints the whole response, including headers. The response includes the full list of albums (the database is initialized with those 3 albums).

$ curl -i -k -u token: "https://localhost:8001/albums.text?band=Slayer" HTTP/1.1 200 OK Content-Type: text/plain; charset=utf-8 Date: Wed, 27 Nov 2013 02:36:46 GMT Content-Length: 68 Slayer - Reign In Blood (1986) Slayer - Seasons In The Abyss (1990)

In this case, the text format is requested, and a filter is used on the band Slayer. Let’s try a POST:

$ curl -i -k -u token: -X POST --data "band=Carcass&title=Heartwork&year=1994" "https://localhost:8001/albums" HTTP/1.1 201 Created Content-Type: application/json Location: /albums/4 Date: Wed, 27 Nov 2013 02:38:55 GMT Content-Length: 57 {"id":4,"band":"Carcass","title":"Heartwork","year":1994}

The status code is 201 - Created, the “Location” header is specified with the URL to the newly created resource (note that the URL should be absolute, I was lazy there), and the resource is returned in the default format (JSON). If we try to create the same resource again, in XML for a change:

$ curl -i -k -u token: -X POST --data "band=Carcass&title=Heartwork&year=1994" "https://localhost:8001/albums.xml" HTTP/1.1 409 Conflict Content-Type: application/xml Date: Wed, 27 Nov 2013 02:41:36 GMT Content-Length: 171 <?xml version="1.0" encoding="UTF-8"?> <albums><error code="2"><message>the album 'Heartwork' from 'Carcass' already exists</message></error></albums>

The error is returned in the correct format, with a status code 409. Updates (PUT) work fine too (I mean, everybody knows Heartwork was released in 1993, right?):

$ curl -i -k -u token: -X PUT --data "band=Carcass&title=Heartwork&year=1993" "https://localhost:8001/albums/4" HTTP/1.1 200 OK Content-Type: application/json Date: Wed, 27 Nov 2013 02:45:29 GMT Content-Length: 57 {"id":4,"band":"Carcass","title":"Heartwork","year":1993}

And finally, the delete operation:

$ curl -i -k -u token: -X DELETE "https://localhost:8001/albums/1" HTTP/1.1 204 No Content Content-Type: application/json Date: Wed, 27 Nov 2013 02:46:59 GMT Content-Length: 0

https required

You don’t want to expose a basic-authenticated API (or any but the most basic public API) over http, and it is recommended to return an error in case of a clear-text call instead of silently redirecting to https, so that the API consumer can take notice. There are many ways to do this, if you have a reverse proxy in front of your API server, this may be a good place to do it. In the example app, I start two listeners, one on http and another on https, and the http server always returns an error:

func main() { go func() { // Listen on http: to raise an error and indicate that https: is required. if err := http.ListenAndServe(":8000", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { http.Error(w, "https scheme is required", http.StatusBadRequest) })); err != nil { log.Fatal(err) } }() // Listen on https: with the preconfigured martini instance. The certificate files // can be created using this command in this repository's root directory: // // go run /path/to/goroot/src/pkg/crypto/tls/generate_cert.go --host="localhost" // if err := http.ListenAndServeTLS(":8001", "cert.pem", "key.pem", m); err != nil { log.Fatal(err) } }

What’s missing

This is a simple API example application, but it still handles many common API tasks. Martini makes this easy and elegant, thanks to its routing and dependency injection mechanisms. In the short time I took to write this article, it already evolved quite a bit, and some handlers got added to the martini-contrib repository too.

If you intend to build a production-quality API, be aware that there are a few important things missing from this small example application, though:

A more powerful authentication mechanism, obviously (a single access token won’t do!)

Support for JSON- (and/or XML-) encoded request bodies for POST and PUT verbs (and PATCH)

Support for 405 - Method not allowed response code (currently, the API will return 404 when an unsupported method is used on a supported route)

Support for GZIP compression of responses (I see that there is now a Gzip middleware on the martini-contrib repository)

Probably more depending on your requirements!

However, this article should’ve given you a taste of how it can be done with Martini.

#REST #API #golang

0 notes

vikasgoel · 10 years

Video

MongoDB and the Democratic Party – A Case Study by Pramod Sadalage

#nosql #mongodb #bigdata

0 notes

vikasgoel · 10 years

Link

Nice talk on NoSQL and related design discussions by @martinfowler

#nosql #bigdata #Dev

0 notes

vikasgoel · 10 years

Link

http://www.mongodb.com/presentations/schema-design-6

#mongodb

0 notes

vikasgoel · 10 years

Video

youtube

Nice talk on HTTP cache. For anyone who works on web applications, this is a must watch presentation.

(via http://www.youtube.com/attribution_link?u=/watch?v=HKNZ-tQQnSY&feature=share&a=fKgo_DDwJnWkhgtAV9U5PQ)

#performance #cache #HTTP

0 notes

vikasgoel · 10 years

Link

MongoDB (formerly 10gen) $231 million Document-oriented database

Mu Sigma 208 Data-Science-as-a-Service

Cloudera 141 Hadoop-based software, services and training

Opera Solutions 114 Data-Science-as-a-Service

Hortonworks 98 Hadoop-based software, services and training

Guavus 87 [updated] Big data analytics solution

DataStax 83.7 Cassandra-based big data platform

GoodData 75.5 Cloud-based platform and big data apps

Talend 61.6 Application and business process integration platform

Couchbase 56 Document-oriented database

#bigdata #vcfunding

0 notes

vikasgoel · 11 years

Text

What are the early symptoms that a startup is going to fail?

William Pietri:

No demonstrated user need. For example, consider 3D movies and TV. If you ask people why they sometimes prefer stage to screen, nobody ever says, "Oh, movies are only 2D." 3D tech has novelty value, but even a little research would show its pushers that most people are perfectly happy to go back to 2D movies after experiencing 3D, and that many actively avoid 3D. That wasn't the case when sound or color or fancier special effects were added.

Fear of testing hypotheses. As a founder, I can say it: most startups launch in a cloud of hype and bullshit. That hype is really useful: Startups are difficult and painful; you have to be really excited to do it. But the hype is also dangerous: It lets people assume they just can't fail. If a startup team doesn't seek contact with reality early and often, they will have a bunch of surprises in store. It hurts to find out your ideas are dumb, so you have to really want to know the truth more than you want to feel comfortable.

No love for the audience. If you are going to spent years studying and serving people, I think you have to love them. YouTube's first designer, who happens to be an old friend, would grab a video camera, hop on his motorcycle, and go to users' houses to see them in their natural element. He's a natural democratizer of technology, and wants people to get really involved in what he's making.

No love for the domain. I would never work at a sports startup, because I don't care much about sports, and never will. If I'm going to do my best work, I really need to love what I'm going to spend all day thinking about.

No love for the team. Hostility between roles? Hostility between founders? Management that doesn't really care for the employees? That company is probably doomed.

A desire for perfection. Perfection kills. The things that your early adopters care about? Those should be awesome. Everything else? Fuck it. A team that has a hard time being pragmatic will spend a lot of their time and money on shit that doesn't matter. And that will keep them from getting the product out early enough to get useful feedback.

Not thinking about revenue. A lot of people want to make a product, not abusiness. What's the difference? The latter makes enough money to pay the bills. I get it: products are exciting; commerce is banal and a little grubby. But until it's a solid business, it's not sustainable. Building shit without thinking about money? Really fun. But startups like that are just playing dress-up at $1m a year.

Caring too much about what other people think. Some people are really worried about what the competition thinks. Or what their friends will think. Or what's cool in Silicon Valley. Or even what their investors think. When instead they should be caring about what their users think, and whether they're staying true to their own vision.

Being in it for the wrong reasons. Is the company being built to flip? Are the people in it to get rich? Is the fun part showboating for the press and the digerati? Are they doing it just to build something they think people should want? Are they high on a Big Idea? God help them.

A high Dunning-Kruger quotient. The heart of the Dunning–Kruger effect is that clueless people can't tell that they're clueless. Teams that know that they don't know much: generally awesome. Teams that think they know it all? Very dangerous.

Matt Rutkowski:

Lack of faith in what the company wants to realize. If you do not believe 100% in what you want to achieve, you will never be successful.

Quarrels between the founders at the early days of the company. If people with a common goal are unable to reach an agreement... it's usually the first sign that the business will fall apart.

If you have a team of people who are looking for problems ... rather than opportunities. You can not pay attention to them, to suggest solutions. Everyone knows all the best and are in their comfort zone. You can forget about success.

You can't sell your product or it's difficult to sell your product. Many startups enter the market with a product without checking whether if there is a need. If you have big problems with selling it, it's a matter of time as the startup gets into even more trouble.

The investor is against you. The fact that the investor gives money does not mean that he wants your company to fare well. For many reasons, mainly personal, the investor may want to get rid of the shares in your company.

#startup #bizdev #entrepreneurship

1 note · View note

vikasgoel · 11 years

Link

One of the important aspects of REST (or at least HTTP) is the concept that some operations (verbs) are idempotent. As Gregor Roth said several years ago:

The PUT method is idempotent. An idempotent method means that the result of a successful performed request is independent of the number of times it is executed.

Idempotency is also something discussed in the SOA Design Patterns. Over the years there has been a lot said about REST and its perceived benefits over other approaches, with idempotency just one aspect that is often assumed to be understood yet typically not investigated fully. However, a recent posting on the Service Oriented Architecture mailing list has brought a lot of discussion around a concept that many had appeared to assume was fairly straightforward. The inital post starts with:

Currently I am looking at Idempotency of our services and coming up with a prescriptive guide for the enterprise in creating cases you create idempotent services. The question is do does all services need to be idempotent ? Is it realistic ?

It references another blog entry, where that author makes the following assertions:

Of course, implementing idempotency can be complex for the service provider. And complexity costs money. That's probably the real reason why idempotent services are so rare. [...] But once a service is made idempotent, it is foolproof, and can guarantee data consistency across integration boundaries. Services should be made idempotent. It should be an architectural and service design principle. Especially so for services which are used by many service consumers. You can't really push out your own complexity of keeping data consistent to your consumers, can you? The consumers are your customers, you should treat them as such. You gain more customers if you make it easy for them.

This initial posting has generated a lot of responses, including from Cap Gemini's Steve Joneswho has this to say:

Idempotent is one of the great rubbish phrases of IT, what it means is 'idempotent in memory'. If you have a comment service with an update capability then clearly from the perspective of the invoker the service has to maintain and manage state. What you mean is can it be horizontally scaled in memory. If you update state, including within a database, then you cannot be idempotent. Idempotent is where you call the same function with the same value and the result is exactly the same, that is the mathematically definition. If you ever update state then you are not idempotent, that its 'a database update' doesn't change anything.

Mark Baker, who has posted a lot on REST in the past, responds to Steve:

There are non-idempotent state updates, like your comment example, and there are idempotent updates that set some datum to some specific value... since obviously if you do that multiple times with the same input, the result will be the same.

Steve agrees with Mark to a point: an operation is mathematically idempotent if it changes nothing with the same input value. However, if there is an update to any state caused by the operation, e.g., it reocrds the latest time of the request, then it is not idempotent.

I've been quite 'impressed' at how people misuse the term these days in IT, I've seen several occasions where people claimed something was 'idempotent' because it wasn't stateful in-memory, even though its consumer effect was recording transactions.

Mark agrees with Steve's strict definition, but believes it does not apply in the context of the original request: idempotency in distributed systems:

[...] the word can only meaningfully refer to the interface and not to the implementation. So if I define an operation, say "set", and define it to be idempotent, then that's all that matters to clients, even if a logentry is generated as part of an implementation of that interface.

Another respondant, Ashaf Galal, makes the assertion that all "reading services" are idempotent since the service is only returning data. However, as Steve points out, this is too simplistic a view and whilst an operation may appear idempotent its implementation may be anything but idempotent (again raising the question about whether this distinction is important):

[...] reading capabilities don't have to be idempotent. 'getNextIterator()' is a reading capability that isn't idempotent as it would increment the iterator. A banking request for a balance wouldn't be idempotent as the request would create an audit log. The returned result might be the same for two subsequent calls (if not changes have happened) but the log entry would be different.

Ashaf responds to Steve by stating that idempotence is a property of the service and not its interface.

The client need to get a proper response from the servcie and he doesn't care if the servcie is idempotence or not, logging or not.

So what do others think? Is there still a misunderstanding with what idempotency means? Does it matter if an operation that is supposed to be idempotent actually makes changes to some state, such as updating a log as Steve mentions?

#REST

0 notes

vikasgoel · 11 years

Text

Great overview of Amazon's Dynamo technology overview by @werner

Amazon's Dynamo

By Werner Vogels on 02 October 2007 08:10 AM | Permalink | Comments (20)

In two weeks we’ll present a paper on the Dynamo technology at SOSP, the prestigious biannual Operating Systems conference. Dynamo is internal technology developed at Amazon to address the need for an incrementally scalable, highly-available key-value storage system. The technology is designed to give its users the ability to trade-off cost, consistency, durability and performance, while maintaining high-availability.

Let me emphasize the internal technology part before it gets misunderstood: Dynamo is not directlyexposed externally as a web service; however, Dynamo and similar Amazon technologies are used to power parts of our Amazon Web Services, such as S3.

We submitted the technology for publication in SOSP because many of the techniques used in Dynamo originate in the operating systems and distributed systems research of the past years; DHTs, consistent hashing, versioning, vector clocks, quorum, anti-entropy based recovery, etc. As far as I know Dynamo is the first production system to use the synthesis of all these techniques, and there are quite a few lessons learned from doing so. The paper is mainly about these lessons.

We are extremely fortunate that the paper was selected for publication in SOSP; only a very few true production systems have made it into the conference and as such it is a recognition of the quality of the work that went into building a real incrementally scalable storage system in which the most important properties can be appropriately configured.

Dynamo is representative of a lot of the work that we are doing at Amazon; we continuously develop cutting edge technologies using recent research, and in many cases do the research ourselves. Much of the engineering work at Amazon, whether it is in infrastructure, distributed systems, workflow, rendering, search, digital, similarities, supply chain, shipping or any of the other systems, is equally highly advanced.

The official reference for the paper is:

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

A pdf version is available here. You can also read the full online version.

The text of the paper is copyright of the ACM and as such the following statement applies:

© ACM, 2007. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in SOSP’07, October 14–17, 2007, Stevenson, Washington, USA, Copyright 2007 ACM 978-1-59593-591-5/07/0010

Dynamo: Amazon’s Highly Available Key-value Store

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels

Amazon.com

Abstract

Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems.

This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

Categories and Subject Descriptors D.4.2 [Operating Systems]: Storage Management; D.4.5 [Operating Systems]: Reliability; D.4.2 [Operating Systems]: Performance;

General Terms Algorithms, Management, Measurement, Performance, Design, Reliability.

1. Introduction

Amazon runs a world-wide e-commerce platform that serves tens of millions customers at peak times using tens of thousands of servers located in many data centers around the world. There are strict operational requirements on Amazon’s platform in terms of performance, reliability and efficiency, and to support continuous growth the platform needs to be highly scalable. Reliability is one of the most important requirements because even the slightest outage has significant financial consequences and impacts customer trust. In addition, to support continuous growth, the platform needs to be highly scalable.

One of the lessons our organization has learned from operating Amazon’s platform is that the reliability and scalability of a system is dependent on how its application state is managed. Amazon uses a highly decentralized, loosely coupled, service oriented architecture consisting of hundreds of services. In this environment there is a particular need for storage technologies that are always available. For example, customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados. Therefore, the service responsible for managing shopping carts requires that it can always write to and read from its data store, and that its data needs to be available across multiple data centers.

Dealing with failures in an infrastructure comprised of millions of components is our standard mode of operation; there are always a small but significant number of server and network components that are failing at any given time. As such Amazon’s software systems need to be constructed in a manner that treats failure handling as the normal case without impacting availability or performance.

To meet the reliability and scaling needs, Amazon has developed a number of storage technologies, of which the Amazon Simple Storage Service (also available outside of Amazon and known as Amazon S3), is probably the best known. This paper presents the design and implementation of Dynamo, another highly available and scalable distributed data store built for Amazon’s platform. Dynamo is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance. Amazon’s platform has a very diverse set of applications with different storage requirements. A select set of applications requires a storage technology that is flexible enough to let application designers configure their data store appropriately based on these tradeoffs to achieve high availability and guaranteed performance in the most cost effective manner.

There are many services on Amazon’s platform that only need primary-key access to a data store. For many services, such as those that provide best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog, the common pattern of using a relational database would lead to inefficiencies and limit scale and availability. Dynamo provides a simple primary-key only interface to meet the requirements of these applications.

Dynamo uses a synthesis of well known techniques to achieve scalability and availability: Data is partitioned and replicated using consistent hashing [10], and consistency is facilitated by object versioning [12]. The consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol. Dynamo employs a gossip based distributed failure detection and membership protocol. Dynamo is a completely decentralized system with minimal need for manual administration. Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution.

In the past year, Dynamo has been the underlying storage technology for a number of the core services in Amazon’s e-commerce platform. It was able to scale to extreme peak loads efficiently without any downtime during the busy holiday shopping season. For example, the service that maintains shopping cart (Shopping Cart Service) served tens of millions requests that resulted in well over 3 million checkouts in a single day and the service that manages session state handled hundreds of thousands of concurrently active sessions.

The main contribution of this work for the research community is the evaluation of how different techniques can be combined to provide a single highly-available system. It demonstrates that an eventually-consistent storage system can be used in production with demanding applications. It also provides insight into the tuning of these techniques to meet the requirements of production systems with very strict performance demands.

The paper is structured as follows. Section 2 presents the background and Section 3 presents the related work. Section 4 presents the system design and Section 5 describes the implementation. Section 6 details the experiences and insights gained by running Dynamo in production and Section 7 concludes the paper. There are a number of places in this paper where additional information may have been appropriate but where protecting Amazon’s business interests require us to reduce some level of detail. For this reason, the intra- and inter-datacenter latencies in section 6, the absolute request rates in section 6.2 and outage lengths and workloads in section 6.3 are provided through aggregate measures instead of absolute details.

2. Background

Amazon’s e-commerce platform is composed of hundreds of services that work in concert to deliver functionality ranging from recommendations to order fulfillment to fraud detection. Each service is exposed through a well defined interface and is accessible over the network. These services are hosted in an infrastructure that consists of tens of thousands of servers located across many data centers world-wide. Some of these services are stateless (i.e., services which aggregate responses from other services) and some are stateful (i.e., a service that generates its response by executing business logic on its state stored in persistent store).

Traditionally production systems store their state in relational databases. For many of the more common usage patterns of state persistence, however, a relational database is a solution that is far from ideal. Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS. This excess functionality requires expensive hardware and highly skilled personnel for its operation, making it a very inefficient solution. In addition, the available replication technologies are limited and typically choose consistency over availability. Although many advances have been made in the recent years, it is still not easy to scale-out databases or use smart partitioning schemes for load balancing.

This paper describes Dynamo, a highly available data storage technology that addresses the needs of these important classes of services. Dynamo has a simple key/value interface, is highly available with a clearly defined consistency window, is efficient in its resource usage, and has a simple scale out scheme to address growth in data set size or request rates. Each service that uses Dynamo runs its own Dynamo instances.

2.1 System Assumptions and Requirements

The storage system for this class of services has the following requirements:

Query Model: simple read and write operations to a data item that is uniquely identified by a key. State is stored as binary objects (i.e., blobs) identified by unique keys. No operations span multiple data items and there is no need for relational schema. This requirement is based on the observation that a significant portion of Amazon’s services can work with this simple query model and do not need any relational schema. Dynamo targets applications that need to store objects that are relatively small (usually less than 1 MB).

ACID Properties: ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. Experience at Amazon has shown that data stores that provide ACID guarantees tend to have poor availability. This has been widely acknowledged by both the industry and academia [5]. Dynamo targets applications that operate with weaker consistency (the “C” in ACID) if this results in high availability. Dynamo does not provide any isolation guarantees and permits only single key updates.

Efficiency: The system needs to function on a commodity hardware infrastructure. In Amazon’s platform, services have stringent latency requirements which are in general measured at the 99.9th percentile of the distribution. Given that state access plays a crucial role in service operation the storage system must be capable of meeting such stringent SLAs (see Section 2.2 below). Services must be able to configure Dynamo such that they consistently achieve their latency and throughput requirements. The tradeoffs are in performance, cost efficiency, availability, and durability guarantees.

Other Assumptions: Dynamo is used only by Amazon’s internal services. Its operation environment is assumed to be non-hostile and there are no security related requirements such as authentication and authorization. Moreover, since each service uses its distinct instance of Dynamo, its initial design targets a scale of up to hundreds of storage hosts. We will discuss the scalability limitations of Dynamo and possible scalability related extensions in later sections.

2.2 Service Level Agreements (SLA)

To guarantee that the application can deliver its functionality in a bounded time, each and every dependency in the platform needs to deliver its functionality with even tighter bounds. Clients and services engage in a Service Level Agreement (SLA), a formally negotiated contract where a client and a service agree on several system-related characteristics, which most prominently include the client’s expected request rate distribution for a particular API and the expected service latency under those conditions. An example of a simple SLA is a service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.

In Amazon’s decentralized service oriented infrastructure, SLAs play an important role. For example a page request to one of the e-commerce sites typically requires the rendering engine to construct its response by sending requests to over 150 services. These services often have multiple dependencies, which frequently are other services, and as such it is not uncommon for the call graph of an application to have more than one level. To ensure that the page rendering engine can maintain a clear bound on page delivery each service within the call chain must obey its performance contract.

Figure 1 shows an abstract view of the architecture of Amazon’s platform, where dynamic web content is generated by page rendering components which in turn query many other services. A service can use different data stores to manage its state and these data stores are only accessible within its service boundaries. Some services act as aggregators by using several other services to produce a composite response. Typically, the aggregator services are stateless, although they use extensive caching.

Figure 1: Service-oriented architecture of Amazon’s platform.

A common approach in the industry for forming a performance oriented SLA is to describe it using average, median and expected variance. At Amazon we have found that these metrics are not good enough if the goal is to build a system where all customers have a good experience, rather than just the majority. For example if extensive personalization techniques are used then customers with longer histories require more processing which impacts performance at the high-end of the distribution. An SLA stated in terms of mean or median response times will not address the performance of this important customer segment. To address this issue, at Amazon, SLAs are expressed and measured at the 99.9th percentile of the distribution. The choice for 99.9% over an even higher percentile has been made based on a cost-benefit analysis which demonstrated a significant increase in cost to improve performance that much. Experiences with Amazon’s production systems have shown that this approach provides a better overall experience compared to those systems that meet SLAs defined based on the mean or median.

In this paper there are many references to this 99.9th percentile of distributions, which reflects Amazon engineers’ relentless focus on performance from the perspective of the customers’ experience. Many papers report on averages, so these are included where it makes sense for comparison purposes. Nevertheless, Amazon’s engineering and optimization efforts are not focused on averages. Several techniques, such as the load balanced selection of write coordinators, are purely targeted at controlling performance at the 99.9th percentile.

Storage systems often play an important role in establishing a service’s SLA, especially if the business logic is relatively lightweight, as is the case for many Amazon services. State management then becomes the main component of a service’s SLA. One of the main design considerations for Dynamo is to give services control over their system properties, such as durability and consistency, and to let services make their own tradeoffs between functionality, performance and cost-effectiveness.

2.3 Design Considerations

Data replication algorithms used in commercial systems traditionally perform synchronous replica coordination in order to provide a strongly consistent data access interface. To achieve this level of consistency, these algorithms are forced to tradeoff the availability of the data under certain failure scenarios. For instance, rather than dealing with the uncertainty of the correctness of an answer, the data is made unavailable until it is absolutely certain that it is correct. From the very early replicated database works, it is well known that when dealing with the possibility of network failures, strong consistency and high data availability cannot be achieved simultaneously [2, 11]. As such systems and applications need to be aware which properties can be achieved under which conditions.

For systems prone to server and network failures, availability can be increased by using optimistic replication techniques, where changes are allowed to propagate to replicas in the background, and concurrent, disconnected work is tolerated. The challenge with this approach is that it can lead to conflicting changes which must be detected and resolved. This process of conflict resolution introduces two problems: when to resolve them and who resolves them. Dynamo is designed to be an eventually consistent data store; that is all updates reach all replicas eventually.

An important design consideration is to decide when to perform the process of resolving update conflicts, i.e., whether conflicts should be resolved during reads or writes. Many traditional data stores execute conflict resolution during writes and keep the read complexity simple [7]. In such systems, writes may be rejected if the data store cannot reach all (or a majority of) the replicas at a given time. On the other hand, Dynamo targets the design space of an “always writeable” data store (i.e., a data store that is highly available for writes). For a number of Amazon services, rejecting customer updates could result in a poor customer experience. For instance, the shopping cart service must allow customers to add and remove items from their shopping cart even amidst network and server failures. This requirement forces us to push the complexity of conflict resolution to the reads in order to ensure that writes are never rejected.

The next design choice is who performs the process of conflict resolution. This can be done by the data store or the application. If conflict resolution is done by the data store, its choices are rather limited. In such cases, the data store can only use simple policies, such as “last write wins” [22], to resolve conflicting updates. On the other hand, since the application is aware of the data schema it can decide on the conflict resolution method that is best suited for its client’s experience. For instance, the application that maintains customer shopping carts can choose to “merge” the conflicting versions and return a single unified shopping cart. Despite this flexibility, some application developers may not want to write their own conflict resolution mechanisms and choose to push it down to the data store, which in turn chooses a simple policy such as “last write wins”.

Other key principles embraced in the design are:

Incremental scalability: Dynamo should be able to scale out one storage host (henceforth, referred to as “node”) at a time, with minimal impact on both operators of the system and the system itself.

Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities. In our experience, symmetry simplifies the process of system provisioning and maintenance.

Decentralization: An extension of symmetry, the design should favor decentralized peer-to-peer techniques over centralized control. In the past, centralized control has resulted in outages and the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more available system.

Heterogeneity: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once.

3. Related Work

3.1 Peer to Peer Systems

There are several peer-to-peer (P2P) systems that have looked at the problem of data storage and distribution. The first generation of P2P systems, such as Freenet and Gnutella, were predominantly used as file sharing systems. These were examples of unstructured P2P networks where the overlay links between peers were established arbitrarily. In these networks, a search query is usually flooded through the network to find as many peers as possible that share the data. P2P systems evolved to the next generation into what is widely known as structured P2P networks. These networks employ a globally consistent protocol to ensure that any node can efficiently route a search query to some peer that has the desired data. Systems like Pastry [16] and Chord [20] use routing mechanisms to ensure that queries can be answered within a bounded number of hops. To reduce the additional latency introduced by multi-hop routing, some P2P systems (e.g., [14]) employ O(1) routing where each peer maintains enough routing information locally so that it can route requests (to access a data item) to the appropriate peer within a constant number of hops.

Various storage systems, such as Oceanstore [9] and PAST [17] were built on top of these routing overlays. Oceanstore provides a global, transactional, persistent storage service that supports serialized updates on widely replicated data. To allow for concurrent updates while avoiding many of the problems inherent with wide-area locking, it uses an update model based on conflict resolution. Conflict resolution was introduced in [21] to reduce the number of transaction aborts. Oceanstore resolves conflicts by processing a series of updates, choosing a total order among them, and then applying them atomically in that order. It is built for an environment where the data is replicated on an untrusted infrastructure. By comparison, PAST provides a simple abstraction layer on top of Pastry for persistent and immutable objects. It assumes that the application can build the necessary storage semantics (such as mutable files) on top of it.

3.2 Distributed File Systems and Databases

Distributing data for performance, availability and durability has been widely studied in the file system and database systems community. Compared to P2P storage systems that only support flat namespaces, distributed file systems typically support hierarchical namespaces. Systems like Ficus [15] and Coda [19] replicate files for high availability at the expense of consistency. Update conflicts are typically managed using specialized conflict resolution procedures. The Farsite system [1] is a distributed file system that does not use any centralized server like NFS. Farsite achieves high availability and scalability using replication. The Google File System [6] is another distributed file system built for hosting the state of Google’s internal applications. GFS uses a simple design with a single master server for hosting the entire metadata and where the data is split into chunks and stored in chunkservers. Bayou is a distributed relational database system that allows disconnected operations and provides eventual data consistency [21].

Among these systems, Bayou, Coda and Ficus allow disconnected operations and are resilient to issues such as network partitions and outages. These systems differ on their conflict resolution procedures. For instance, Coda and Ficus perform system level conflict resolution and Bayou allows application level resolution. All of them, however, guarantee eventual consistency. Similar to these systems, Dynamo allows read and write operations to continue even during network partitions and resolves updated conflicts using different conflict resolution mechanisms. Distributed block storage systems like FAB [18] split large size objects into smaller blocks and stores each block in a highly available manner. In comparison to these systems, a key-value store is more suitable in this case because: (a) it is intended to store relatively small objects (size < 1M) and (b) key-value stores are easier to configure on a per-application basis. Antiquity is a wide-area distributed storage system designed to handle multiple server failures [23]. It uses a secure log to preserve data integrity, replicates each log on multiple servers for durability, and uses Byzantine fault tolerance protocols to ensure data consistency. In contrast to Antiquity, Dynamo does not focus on the problem of data integrity and security and is built for a trusted environment. Bigtable is a distributed storage system for managing structured data. It maintains a sparse, multi-dimensional sorted map and allows applications to access their data using multiple attributes [2]. Compared to Bigtable, Dynamo targets applications that require only key/value access with primary focus on high availability where updates are not rejected even in the wake of network partitions or server failures.

Traditional replicated relational database systems focus on the problem of guaranteeing strong consistency to replicated data. Although strong consistency provides the application writer a convenient programming model, these systems are limited in scalability and availability [7]. These systems are not capable of handling network partitions because they typically provide strong consistency guarantees.

3.3 Discussion

Dynamo differs from the aforementioned decentralized storage systems in terms of its target requirements. First, Dynamo is targeted mainly at applications that need an “always writeable” data store where no updates are rejected due to failures or concurrent writes. This is a crucial requirement for many Amazon applications. Second, as noted earlier, Dynamo is built for an infrastructure within a single administrative domain where all nodes are assumed to be trusted. Third, applications that use Dynamo do not require support for hierarchical namespaces (a norm in many file systems) or complex relational schema (supported by traditional databases). Fourth, Dynamo is built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds. To meet these stringent latency requirements, it was imperative for us to avoid routing requests through multiple nodes (which is the typical design adopted by several distributed hash table systems such as Chord and Pastry). This is because multi-hop routing increases variability in response times, thereby increasing the latency at higher percentiles. Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly.

4.System Architecture

The architecture of a storage system that needs to operate in a production setting is complex. In addition to the actual data persistence component, the system needs to have scalable and robust solutions for load balancing, membership and failure detection, failure recovery, replica synchronization, overload handling, state transfer, concurrency and job scheduling, request marshalling, request routing, system monitoring and alarming, and configuration management. Describing the details of each of the solutions is not possible, so this paper focuses on the core distributed systems techniques used in Dynamo: partitioning, replication, versioning, membership, failure handling and scaling. Table 1 presents a summary of the list of techniques Dynamo uses and their respective advantages.

Table 1: Summary of techniques used in Dynamo and their advantages.

Problem

Technique

Advantage

Partitioning

Consistent Hashing

Incremental Scalability

High Availability for writes

Vector clocks with reconciliation during reads

Version size is decoupled from update rates.

Handling temporary failures

Sloppy Quorum and hinted handoff

Provides high availability and durability guarantee when some of the replicas are not available.

Recovering from permanent failures

Anti-entropy using Merkle trees

Synchronizes divergent replicas in the background.

Membership and failure detection

Gossip-based membership protocol and failure detection.

Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

4.1 System Interface

Dynamo stores objects associated with a key through a simple interface; it exposes two operations: get() and put(). The get(key) operation locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context. The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk. The context encodes system metadata about the object that is opaque to the caller and includes information such as the version of the object. The context information is stored along with the object so that the system can verify the validity of the context object supplied in the put request.

Dynamo treats both the key and the object supplied by the caller as an opaque array of bytes. It applies a MD5 hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are responsible for serving the key.

4.2 Partitioning Algorithm

One of the key design requirements for Dynamo is that it must scale incrementally. This requires a mechanism to dynamically partition the data over the set of nodes (i.e., storage hosts) in the system. Dynamo’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. In consistent hashing [10], the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.

The basic consistent hashing algorithm presents some challenges. First, the random position assignment of each node on the ring leads to non-uniform data and load distribution. Second, the basic algorithm is oblivious to the heterogeneity in the performance of nodes. To address these issues, Dynamo uses a variant of consistent hashing (similar to the one used in [10, 20]): instead of mapping a node to a single point in the circle, each node gets assigned to multiple points in the ring. To this end, Dynamo uses the concept of “virtual nodes”. A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node. Effectively, when a new node is added to the system, it is assigned multiple positions (henceforth, “tokens”) in the ring. The process of fine-tuning Dynamo’s partitioning scheme is discussed in Section 6.

Using virtual nodes has the following advantages:

If a node becomes unavailable (due to failures or routine maintenance), the load handled by this node is evenly dispersed across the remaining available nodes.

When a node becomes available again, or a new node is added to the system, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.

The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.

4.3 Replication

To achieve high availability and durability, Dynamo replicates its data on multiple hosts. Each data item is replicated at N hosts, where N is a parameter configured “per-instance”. Each key, k, is assigned to a coordinator node (described in the previous section). The coordinator is in charge of the replication of the data items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 clockwise successor nodes in the ring. This results in a system where each node is responsible for the region of the ring between it and its Nth predecessor. In Figure 2, node B replicates the key k at nodes C and D in addition to storing it locally. Node D will store the keys that fall in the ranges (A, B], (B, C], and (C, D].

Figure 2: Partitioning and replication of keys in Dynamo ring.

The list of nodes that is responsible for storing a particular key is called the preference list. The system is designed, as will be explained in Section 4.8, so that every node in the system can determine which nodes should be in this list for any particular key. To account for node failures, preference list contains more than N nodes. Note that with the use of virtual nodes, it is possible that the first N successor positions for a particular key may be owned by less than N distinct physical nodes (i.e. a node may hold more than one of the first N positions). To address this, the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes.

4.4 Data Versioning

Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously. A put() call may return to its caller before the update has been applied at all the replicas, which can result in scenarios where a subsequent get() operation may return an object that does not have the latest updates.. If there are no failures then there is a bound on the update propagation times. However, under certain failure scenarios (e.g., server outages or network partitions), updates may not arrive at all replicas for an extended period of time.

There is a category of applications in Amazon’s platform that can tolerate such inconsistencies and can be constructed to operate under these conditions. For example, the shopping cart application requires that an “Add to Cart” operation can never be forgotten or rejected. If the most recent state of the cart is unavailable, and a user makes changes to an older version of the cart, that change is still meaningful and should be preserved. But at the same time it shouldn’t supersede the currently unavailable state of the cart, which itself may contain changes that should be preserved. Note that both “add to cart” and “delete item from cart” operations are translated into put requests to Dynamo. When a customer wants to add an item to (or remove from) a shopping cart and the latest version is not available, the item is added to (or removed from) the older version and the divergent versions are reconciled later.

In order to provide this kind of guarantee, Dynamo treats the result of each modification as a new and immutable version of the data. It allows for multiple versions of an object to be present in the system at the same time. Most of the time, new versions subsume the previous version(s), and the system itself can determine the authoritative version (syntactic reconciliation). However, version branching may happen, in the presence of failures combined with concurrent updates, resulting in conflicting versions of an object. In these cases, the system cannot reconcile the multiple versions of the same object and the client must perform the reconciliation in order to collapse multiple branches of data evolution back into one (semantic reconciliation). A typical example of a collapse operation is “merging” different versions of a customer’s shopping cart. Using this reconciliation mechanism, an “add to cart” operation is never lost. However, deleted items can resurface.

It is important to understand that certain failure modes can potentially result in the system having not just two but several versions of the same data. Updates in the presence of network partitions and node failures can potentially result in an object having distinct version sub-histories, which the system will need to reconcile in the future. This requires us to design applications that explicitly acknowledge the possibility of multiple versions of the same data (in order to never lose any updates).

Dynamo uses vector clocks [12] in order to capture causality between different versions of the same object. A vector clock is effectively a list of (node, counter) pairs. One vector clock is associated with every version of every object. One can determine whether two versions of an object are on parallel branches or have a causal ordering, by examine their vector clocks. If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten. Otherwise, the two changes are considered to be in conflict and require reconciliation.

In Dynamo, when a client wishes to update an object, it must specify which version it is updating. This is done by passing the context it obtained from an earlier read operation, which contains the vector clock information. Upon processing a read request, if Dynamo has access to multiple branches that cannot be syntactically reconciled, it will return all the objects at the leaves, with the corresponding version information in the context. An update using this context is considered to have reconciled the divergent versions and the branches are collapsed into a single new version.

Figure 3: Version evolution of an object over time.

To illustrate the use of vector clocks, let us consider the example shown in Figure 3. A client writes a new object. The node (say Sx) that handles the write for this key increases its sequence number and uses it to create the data's vector clock. The system now has the object D1 and its associated clock [(Sx, 1)]. The client updates the object. Assume the same node handles this request as well. The system now also has object D2 and its associated clock [(Sx, 2)]. D2 descends from D1 and therefore over-writes D1, however there may be replicas of D1 lingering at nodes that have not yet seen D2. Let us assume that the same client updates the object again and a different server (say Sy) handles the request. The system now has data D3 and its associated clock [(Sx, 2), (Sy, 1)].

Next assume a different client reads D2 and then tries to update it, and another node (say Sz) does the write. The system now has D4 (descendant of D2) whose version clock is [(Sx, 2), (Sz, 1)]. A node that is aware of D1 or D2 could determine, upon receiving D4 and its clock, that D1 and D2 are overwritten by the new data and can be garbage collected. A node that is aware of D3 and receives D4 will find that there is no causal relation between them. In other words, there are changes in D3 and D4 that are not reflected in each other. Both versions of the data must be kept and presented to a client (upon a read) for semantic reconciliation.

Now assume some client reads both D3 and D4 (the context will reflect that both values were found by the read). The read's context is a summary of the clocks of D3 and D4, namely [(Sx, 2), (Sy, 1), (Sz, 1)]. If the client performs the reconciliation and node Sx coordinates the write, Sx will update its sequence number in the clock. The new data D5 will have the following clock: [(Sx, 3), (Sy, 1), (Sz, 1)].

A possible issue with vector clocks is that the size of vector clocks may grow if many servers coordinate the writes to an object. In practice, this is not likely because the writes are usually handled by one of the top N nodes in the preference list. In case of network partitions or multiple server failures, write requests may be handled by nodes that are not in the top N nodes in the preference list causing the size of vector clock to grow. In these scenarios, it is desirable to limit the size of vector clock. To this end, Dynamo employs the following clock truncation scheme: Along with each (node, counter) pair, Dynamo stores a timestamp that indicates the last time the node updated the data item. When the number of (node, counter) pairs in the vector clock reaches a threshold (say 10), the oldest pair is removed from the clock. Clearly, this truncation scheme can lead to inefficiencies in reconciliation as the descendant relationships cannot be derived accurately. However, this problem has not surfaced in production and therefore this issue has not been thoroughly investigated.

4.5 Execution of get () and put () operations

Any storage node in Dynamo is eligible to receive client get and put operations for any key. In this section, for sake of simplicity, we describe how these operations are performed in a failure-free environment and in the subsequent section we describe how read and write operations are executed during failures.

Both get and put operations are invoked using Amazon’s infrastructure-specific request processing framework over HTTP. There are two strategies that a client can use to select a node: (1) route its request through a generic load balancer that will select a node based on load information, or (2) use a partition-aware client library that routes requests directly to the appropriate coordinator nodes. The advantage of the first approach is that the client does not have to link any code specific to Dynamo in its application, whereas the second strategy can achieve lower latency because it skips a potential forwarding step.

A node handling a read or write operation is known as the coordinator. Typically, this is the first among the top N nodes in the preference list. If the requests are received through a load balancer, requests to access a key may be routed to any random node in the ring. In this scenario, the node that receives the request will not coordinate it if the node is not in the top N of the requested key’s preference list. Instead, that node will forward the request to the first among the top N nodes in the preference list.

Read and write operations involve the first N healthy nodes in the preference list, skipping over those that are down or inaccessible. When all nodes are healthy, the top N nodes in a key’s preference list are accessed. When there are node failures or network partitions, nodes that are lower ranked in the preference list are accessed.

To maintain consistency among its replicas, Dynamo uses a consistency protocol similar to those used in quorum systems. This protocol has two key configurable values: R and W. R is the minimum number of nodes that must participate in a successful read operation. W is the minimum number of nodes that must participate in a successful write operation. Setting R and W such that R + W > N yields a quorum-like system. In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.

Upon receiving a put() request for a key, the coordinator generates the vector clock for the new version and writes the new version locally. The coordinator then sends the new version (along with the new vector clock) to the N highest-ranked reachable nodes. If at least W-1 nodes respond then the write is considered successful.

Similarly, for a get() request, the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, and then waits for R responses before returning the result to the client. If the coordinator ends up gathering multiple versions of the data, it returns all the versions it deems to be causally unrelated. The divergent versions are then reconciled and the reconciled version superseding the current versions is written back.

4.6 Handling Failures: Hinted Handoff

If Dynamo used a traditional quorum approach it would be unavailable during server failures and network partitions, and would have reduced durability even under the simplest of failure conditions. To remedy this it does not enforce strict quorum membership and instead it uses a “sloppy quorum”; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring.

Consider the example of Dynamo configuration given in Figure 2 with N=3. In this example, if node A is temporarily down or unreachable during a write operation then a replica that would normally have lived on A will now be sent to node D. This is done to maintain the desired availability and durability guarantees. The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case A). Nodes that receive hinted replicas will keep them in a separate local database that is scanned periodically. Upon detecting that A has recovered, D will attempt to deliver the replica to A. Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system.

Using hinted handoff, Dynamo ensures that the read and write operations are not failed due to temporary node or network failures. Applications that need the highest level of availability can set W to 1, which ensures that a write is accepted as long as a single node in the system has durably written the key it to its local store. Thus, the write request is only rejected if all nodes in the system are unavailable. However, in practice, most Amazon services in production set a higher W to meet the desired level of durability. A more detailed discussion of configuring N, R and W follows in section 6.

It is imperative that a highly available storage system be capable of handling the failure of an entire data center(s). Data center failures happen due to power outages, cooling failures, network failures, and natural disasters. Dynamo is configured such that each object is replicated across multiple data centers. In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple data centers. These datacenters are connected through high speed network links. This scheme of replicating across multiple datacenters allows us to handle entire data center failures without a data outage.

4.7 Handling permanent failures: Replica synchronization

Hinted handoff works best if the system membership churn is low and node failures are transient. There are scenarios under which hinted replicas become unavailable before they can be returned to the original replica node. To handle this and other threats to durability, Dynamo implements an anti-entropy (replica synchronization) protocol to keep the replicas synchronized.

To detect the inconsistencies between replicas faster and to minimize the amount of transferred data, Dynamo uses Merkle trees [13]. A Merkle tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children. The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set. Moreover, Merkle trees help in reducing the amount of data that needs to be transferred while checking for inconsistencies among replicas. For instance, if the hash values of the root of two trees are equal, then the values of the leaf nodes in the tree are equal and the nodes require no synchronization. If not, it implies that the values of some replicas are different. In such cases, the nodes may exchange the hash values of children and the process continues until it reaches the leaves of the trees, at which point the hosts can identify the keys that are “out of sync”. Merkle trees minimize the amount of data that needs to be transferred for synchronization and reduce the number of disk reads performed during the anti-entropy process.

Dynamo uses Merkle trees for anti-entropy as follows: Each node maintains a separate Merkle tree for each key range (the set of keys covered by a virtual node) it hosts. This allows nodes to compare whether the keys within a key range are up-to-date. In this scheme, two nodes exchange the root of the Merkle tree corresponding to the key ranges that they host in common. Subsequently, using the tree traversal scheme described above the nodes determine if they have any differences and perform the appropriate synchronization action. The disadvantage with this scheme is that many key ranges change when a node joins or leaves the system thereby requiring the tree(s) to be recalculated. This issue is addressed, however, by the refined partitioning scheme described in Section 6.2.

4.8 Membership and Failure Detection

4.8.1 Ring Membership

In Amazon’s environment node outages (due to failures and maintenance tasks) are often transient but may last for extended intervals. A node outage rarely signifies a permanent departure and therefore should not result in rebalancing of the partition assignment or repair of the unreachable replicas. Similarly, manual error could result in the unintentional startup of new Dynamo nodes. For these reasons, it was deemed appropriate to use an explicit mechanism to initiate the addition and removal of nodes from a Dynamo ring. An administrator uses a command line tool or a browser to connect to a Dynamo node and issue a membership change to join a node to a ring or remove a node from a ring. The node that serves the request writes the membership change and its time of issue to persistent store. The membership changes form a history because nodes can be removed and added back multiple times. A gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership. Each node contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted membership change histories.

When a node starts for the first time, it chooses its set of tokens (virtual nodes in the consistent hash space) and maps nodes to their respective token sets. The mapping is persisted on disk and initially contains only the local node and token set. The mappings stored at different Dynamo nodes are reconciled during the same communication exchange that reconciles the membership change histories. Therefore, partitioning and placement information also propagates via the gossip-based protocol and each storage node is aware of the token ranges handled by its peers. This allows each node to forward a key’s read/write operations to the right set of nodes directly.

4.8.2 External Discovery

The mechanism described above could temporarily result in a logically partitioned Dynamo ring. For example, the administrator could contact node A to join A to the ring, then contact node B to join B to the ring. In this scenario, nodes A and B would each consider itself a member of the ring, yet neither would be immediately aware of the other. To prevent logical partitions, some Dynamo nodes play the role of seeds. Seeds are nodes that are discovered via an external mechanism and are known to all nodes. Because all nodes eventually reconcile their membership with a seed, logical partitions are highly unlikely. Seeds can be obtained either from static configuration or from a configuration service. Typically seeds are fully functional nodes in the Dynamo ring.

4.8.3 Failure Detection

Failure detection in Dynamo is used to avoid attempts to communicate with unreachable peers during get() and put() operations and when transferring partitions and hinted replicas. For the purpose of avoiding failed attempts at communication, a purely local notion of failure detection is entirely sufficient: node A may consider node B failed if node B does not respond to node A’s messages (even if B is responsive to node C's messages). In the presence of a steady rate of client requests generating inter-node communication in the Dynamo ring, a node A quickly discovers that a node B is unresponsive when B fails to respond to a message; Node A then uses alternate nodes to service requests that map to B's partitions; A periodically retries B to check for the latter's recovery. In the absence of client requests to drive traffic between two nodes, neither node really needs to know whether the other is reachable and responsive.

Decentralized failure detection protocols use a simple gossip-style protocol that enable each node in the system to learn about the arrival (or departure) of other nodes. For detailed information on decentralized failure detectors and the parameters affecting their accuracy, the interested reader is referred to [8]. Early designs of Dynamo used a decentralized failure detector to maintain a globally consistent view of failure state. Later it was determined that the explicit node join and leave methods obviates the need for a global view of failure state. This is because nodes are notified of permanent node additions and removals by the explicit node join and leave methods and temporary node failures are detected by the individual nodes when they fail to communicate with others (while forwarding requests).

4.9 Adding/Removing Storage Nodes

When a new node (say X) is added into the system, it gets assigned a number of tokens that are randomly scattered on the ring. For every key range that is assigned to node X, there may be a number of nodes (less than or equal to N) that are currently in charge of handling keys that fall within its token range. Due to the allocation of key ranges to X, some existing nodes no longer have to some of their keys and these nodes transfer those keys to X. Let us consider a simple bootstrapping scenario where node X is added to the ring shown in Figure 2 between A and B. When X is added to the system, it is in charge of storing keys in the ranges (F, G], (G, A] and (A, X]. As a consequence, nodes B, C and D no longer have to store the keys in these respective ranges. Therefore, nodes B, C, and D will offer to and upon confirmation from X transfer the appropriate set of keys. When a node is removed from the system, the reallocation of keys happens in a reverse process.

Operational experience has shown that this approach distributes the load of key distribution uniformly across the storage nodes, which is important to meet the latency requirements and to ensure fast bootstrapping. Finally, by adding a confirmation round between the source and the destination, it is made sure that the destination node does not receive any duplicate transfers for a given key range.

5.Implementation

In Dynamo, each storage node has three main software components: request coordination, membership and failure detection, and a local persistence engine. All these components are implemented in Java.

Dynamo’s local persistence component allows for different storage engines to be plugged in. Engines that are in use are Berkeley Database (BDB) Transactional Data Store, BDB Java Edition, MySQL, and an in-memory buffer with persistent backing store. The main reason for designing a pluggable persistence component is to choose the storage engine best suited for an application’s access patterns. For instance, BDB can handle objects typically in the order of tens of kilobytes whereas MySQL can handle objects of larger sizes. Applications choose Dynamo’s local persistence engine based on their object size distribution. The majority of Dynamo’s production instances use BDB Transactional Data Store.

The request coordination component is built on top of an event-driven messaging substrate where the message processing pipeline is split into multiple stages similar to the SEDA architecture [24]. All communications are implemented using Java NIO channels. The coordinator executes the read and write requests on behalf of clients by collecting data from one or more nodes (in the case of reads) or storing data at one or more nodes (for writes). Each client request results in the creation of a state machine on the node that received the client request. The state machine contains all the logic for identifying the nodes responsible for a key, sending the requests, waiting for responses, potentially doing retries, processing the replies and packaging the response to the client. Each state machine instance handles exactly one client request. For instance, a read operation implements the following state machine: (i) send read requests to the nodes, (ii) wait for minimum number of required responses, (iii) if too few replies were received within a given time bound, fail the request, (iv) otherwise gather all the data versions and determine the ones to be returned and (v) if versioning is enabled, perform syntactic reconciliation and generate an opaque write context that contains the vector clock that subsumes all the remaining versions. For the sake of brevity the failure handling and retry states are left out.

After the read response has been returned to the caller the state machine waits for a small period of time to receive any outstanding responses. If stale versions were returned in any of the responses, the coordinator updates those nodes with the latest version. This process is called read repair because it repairs replicas that have missed a recent update at an opportunistic time and relieves the anti-entropy protocol from having to do it.

As noted earlier, write requests are coordinated by one of the top N nodes in the preference list. Although it is desirable always to have the first node among the top N to coordinate the writes thereby serializing all writes at a single location, this approach has led to uneven load distribution resulting in SLA violations. This is because the request load is not uniformly distributed across objects. To counter this, any of the top N nodes in the preference list is allowed to coordinate the writes. In particular, since each write usually follows a read operation, the coordinator for a write is chosen to be the node that replied fastest to the previous read operation which is stored in the context information of the request. This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting “read-your-writes” consistency. It also reduces variability in the performance of the request handling which improves the performance at the 99.9 percentile.

6. Experiences & Lessons Learned

Dynamo is used by several services with different configurations. These instances differ by their version reconciliation logic, and read/write quorum characteristics. The following are the main patterns in which Dynamo is used:

Business logic specific reconciliation: This is a popular use case for Dynamo. Each data object is replicated across multiple nodes. In case of divergent versions, the client application performs its own reconciliation logic. The shopping cart service discussed earlier is a prime example of this category. Its business logic reconciles objects by merging different versions of a customer’s shopping cart.

Timestamp based reconciliation: This case differs from the previous one only in the reconciliation mechanism. In case of divergent versions, Dynamo performs simple timestamp based reconciliation logic of “last write wins”; i.e., the object with the largest physical timestamp value is chosen as the correct version. The service that maintains customer’s session information is a good example of a service that uses this mode.

High performance read engine: While Dynamo is built to be an “always writeable” data store, a few services are tuning its quorum characteristics and using it as a high performance read engine. Typically, these services have a high read request rate and only a small number of updates. In this configuration, typically R is set to be 1 and W to be N. For these services, Dynamo provides the ability to partition and replicate their data across multiple nodes thereby offering incremental scalability. Some of these instances function as the authoritative persistence cache for data stored in more heavy weight backing stores. Services that maintain product catalog and promotional items fit in this category.

The main advantage of Dynamo is that its client applications can tune the values of N, R and W to achieve their desired levels of performance, availability and durability. For instance, the value of N determines the durability of each object. A typical value of N used by Dynamo’s users is 3.

The values of W and R impact object availability, durability and consistency. For instance, if W is set to 1, then the system will never reject a write request as long as there is at least one node in the system that can successfully process a write request. However, low values of W and R can increase the risk of inconsistency as write requests are deemed successful and returned to the clients even if they are not processed by a majority of the replicas. This also introduces a vulnerability window for durability when a write request is successfully returned to the client even though it has been persisted at only a small number of nodes.

Traditional wisdom holds that durability and availability go hand-in-hand. However, this is not necessarily true here. For instance, the vulnerability window for durability can be decreased by increasing W. This may increase the probability of rejecting requests (thereby decreasing availability) because more storage hosts need to be alive to process a write request.

The common (N,R,W) configuration used by several instances of Dynamo is (3,2,2). These values are chosen to meet the necessary levels of performance, durability, consistency, and availability SLAs.

All the measurements presented in this section were taken on a live system operating with a configuration of (3,2,2) and running a couple hundred nodes with homogenous hardware configurations. As mentioned earlier, each instance of Dynamo contains nodes that are located in multiple datacenters. These datacenters are typically connected through high speed network links. Recall that to generate a successful get (or put) response R (or W) nodes need to respond to the coordinator. Clearly, the network latencies between datacenters affect the response time and the nodes (and their datacenter locations) are chosen such that the applications target SLAs are met.

6.1 Balancing Performance and Durability

While Dynamo’s principle design goal is to build a highly available data store, performance is an equally important criterion in Amazon’s platform. As noted earlier, to provide a consistent customer experience, Amazon’s services set their performance targets at higher percentiles (such as the 99.9th or 99.99th percentiles). A typical SLA required of services that use Dynamo is that 99.9% of the read and write requests execute within 300ms.

Since Dynamo is run on standard commodity hardware components that have far less I/O throughput than high-end enterprise servers, providing consistently high performance for read and write operations is a non-trivial task. The involvement of multiple storage nodes in read and write operations makes it even more challenging, since the performance of these operations is limited by the slowest of the R or W replicas. Figure 4 shows the average and 99.9th percentile latencies of Dynamo’s read and write operations during a period of 30 days. As seen in the figure, the latencies exhibit a clear diurnal pattern which is a result of the diurnal pattern in the incoming request rate (i.e., there is a significant difference in request rate between the daytime and night). Moreover, the write latencies are higher than read latencies obviously because write operations always results in disk access. Also, the 99.9th percentile latencies are around 200 ms and are an order of magnitude higher than the averages. This is because the 99.9th percentile latencies are affected by several factors such as variability in request load, object sizes, and locality patterns.

Figure 4: Average and 99.9 percentiles of latencies for read and write requests during our peak request season of December 2006. The intervals between consecutive ticks in the x-axis correspond to 12 hours. Latencies follow a diurnal pattern similar to the request rate and 99.9 percentile latencies are an order of magnitude higher than averages.

While this level of performance is acceptable for a number of services, a few customer-facing services required higher levels of performance. For these services, Dynamo provides the ability to trade-off durability guarantees for performance. In the optimization each storage node maintains an object buffer in its main memory. Each write operation is stored in the buffer and gets periodically written to storage by a writer thread. In this scheme, read operations first check if the requested key is present in the buffer. If so, the object is read from the buffer instead of the storage engine.

This optimization has resulted in lowering the 99.9th percentile latency by a factor of 5 during peak traffic even for a very small buffer of a thousand objects (see Figure 5). Also, as seen in the figure, write buffering smoothes out higher percentile latencies. Obviously, this scheme trades durability for performance. In this scheme, a server crash can result in missing writes that were queued up in the buffer. To reduce the durability risk, the write operation is refined to have the coordinator choose one out of the N replicas to perform a “durable write”. Since the coordinator waits only for W responses, the performance of the write operation is not affected by the performance of the durable write operation performed by a single replica.

Figure 5: Comparison of performance of 99.9th percentile latencies for buffered vs. non-buffered writes over a period of 24 hours. The intervals between consecutive ticks in the x-axis correspond to one hour.

6.2 Ensuring Uniform Load distribution

Dynamo uses consistent hashing to partition its key space across its replicas and to ensure uniform load distribution. A uniform key distribution can help us achieve uniform load distribution assuming the access distribution of keys is not highly skewed. In particular, Dynamo’s design assumes that even where there is a significant skew in the access distribution there are enough keys in the popular end of the distribution so that the load of handling popular keys can be spread across the nodes uniformly through partitioning. This section discusses the load imbalance seen in Dynamo and the impact of different partitioning strategies on load distribution.

To study the load imbalance and its correlation with request load, the total number of requests received by each node was measured for a period of 24 hours - broken down into intervals of 30 minutes. In a given time window, a node is considered to be “in-balance”, if the node’s request load deviates from the average load by a value a less than a certain threshold (here 15%). Otherwise the node was deemed “out-of-balance”. Figure 6 presents the fraction of nodes that are “out-of-balance” (henceforth, “imbalance ratio”) during this time period. For reference, the corresponding request load received by the entire system during this time period is also plotted. As seen in the figure, the imbalance ratio decreases with increasing load. For instance, during low loads the imbalance ratio is as high as 20% and during high loads it is close to 10%. Intuitively, this can be explained by the fact that under high loads, a large number of popular keys are accessed and due to uniform distribution of keys the load is evenly distributed. However, during low loads (where load is 1/8th of the measured peak load), fewer popular keys are accessed, resulting in a higher load imbalance.

Figure 6: Fraction of nodes that are out-of-balance (i.e., nodes whose request load is above a certain threshold from the average system load) and their corresponding request load. The interval between ticks in x-axis corresponds to a time period of 30 minutes.

This section discusses how Dynamo’s partitioning scheme has evolved over time and its implications on load distribution.

Strategy 1: T random tokens per node and partition by token value: This was the initial strategy deployed in production (and described in Section 4.2). In this scheme, each node is assigned T tokens (chosen uniformly at random from the hash space). The tokens of all nodes are ordered according to their values in the hash space. Every two consecutive tokens define a range. The last token and the first token form a range that "wraps" around from the highest value to the lowest value in the hash space. Because the tokens are chosen randomly, the ranges vary in size. As nodes join and leave the system, the token set changes and consequently the ranges change. Note that the space needed to maintain the membership at each node increases linearly with the number of nodes in the system.

While using this strategy, the following problems were encountered. First, when a new node joins the system, it needs to “steal” its key ranges from other nodes. However, the nodes handing the key ranges off to the new node have to scan their local persistence store to retrieve the appropriate set of data items. Note that performing such a scan operation on a production node is tricky as scans are highly resource intensive operations and they need to be executed in the background without affecting the customer performance. This requires us to run the bootstrapping task at the lowest priority. However, this significantly slows the bootstrapping process and during busy shopping season, when the nodes are handling millions of requests a day, the bootstrapping has taken almost a day to complete. Second, when a node joins/leaves the system, the key ranges handled by many nodes change and the Merkle trees for the new ranges need to be recalculated, which is a non-trivial operation to perform on a production system. Finally, there was no easy way to take a snapshot of the entire key space due to the randomness in key ranges, and this made the process of archival complicated. In this scheme, archiving the entire key space requires us to retrieve the keys from each node separately, which is highly inefficient.

The fundamental issue with this strategy is that the schemes for data partitioning and data placement are intertwined. For instance, in some cases, it is preferred to add more nodes to the system in order to handle an increase in request load. However, in this scenario, it is not possible to add nodes without affecting data partitioning. Ideally, it is desirable to use independent schemes for partitioning and placement. To this end, following strategies were evaluated:

Strategy 2: T random tokens per node and equal sized partitions: In this strategy, the hash space is divided into Q equally sized partitions/ranges and each node is assigned T random tokens. Q is usually set such that Q >> N and Q >> S*T, where S is the number of nodes in the system. In this strategy, the tokens are only used to build the function that maps values in the hash space to the ordered lists of nodes and not to decide the partitioning. A partition is placed on the first N unique nodes that are encountered while walking the consistent hashing ring clockwise from the end of the partition. Figure 7 illustrates this strategy for N=3. In this example, nodes A, B, C are encountered while walking the ring from the end of the partition that contains key k1. The primary advantages of this strategy are: (i) decoupling of partitioning and partition placement, and (ii) enabling the possibility of changing the placement scheme at runtime.

Figure 7: Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the preference list for the key k1 on the consistent hashing ring (N=3). The shaded area indicates the key range for which nodes A, B, and C form the preference list. Dark arrows indicate the token locations for various nodes.

Strategy 3: Q/S tokens per node, equal-sized partitions: Similar to strategy 2, this strategy divides the hash space into Q equally sized partitions and the placement of partition is decoupled from the partitioning scheme. Moreover, each node is assigned Q/S tokens where S is the number of nodes in the system. When a node leaves the system, its tokens are randomly distributed to the remaining nodes such that these properties are preserved. Similarly, when a node joins the system it "steals" tokens from nodes in the system in a way that preserves these properties.

The efficiency of these three strategies is evaluated for a system with S=30 and N=3. However, comparing these different strategies in a fair manner is hard as different strategies have different configurations to tune their efficiency. For instance, the load distribution property of strategy 1 depends on the number of tokens (i.e., T) while strategy 3 depends on the number of partitions (i.e., Q). One fair way to compare these strategies is to evaluate the skew in their load distribution while all strategies use the same amount of space to maintain their membership information. For instance, in strategy 1 each node needs to maintain the token positions of all the nodes in the ring and in strategy 3 each node needs to maintain the information regarding the partitions assigned to each node.

In our next experiment, these strategies were evaluated by varying the relevant parameters (T and Q). The load balancing efficiency of each strategy was measured for different sizes of membership information that needs to be maintained at each node, where Load balancing efficiency is defined as the ratio of average number of requests served by each node to the maximum number of requests served by the hottest node.

The results are given in Figure 8. As seen in the figure, strategy 3 achieves the best load balancing efficiency and strategy 2 has the worst load balancing efficiency. For a brief time, Strategy 2 served as an interim setup during the process of migrating Dynamo instances from using Strategy 1 to Strategy 3. Compared to Strategy 1, Strategy 3 achieves better efficiency and reduces the size of membership information maintained at each node by three orders of magnitude. While storage is not a major issue the nodes gossip the membership information periodically and as such it is desirable to keep this information as compact as possible. In addition to this, strategy 3 is advantageous and simpler to deploy for the following reasons: (i) Faster bootstrapping/recovery: Since partition ranges are fixed, they can be stored in separate files, meaning a partition can be relocated as a unit by simply transferring the file (avoiding random accesses needed to locate specific items). This simplifies the process of bootstrapping and recovery. (ii) Ease of archival: Periodical archiving of the dataset is a mandatory requirement for most of Amazon storage services. Archiving the entire dataset stored by Dynamo is simpler in strategy 3 because the partition files can be archived separately. By contrast, in Strategy 1, the tokens are chosen randomly and, archiving the data stored in Dynamo requires retrieving the keys from individual nodes separately and is usually inefficient and slow. The disadvantage of strategy 3 is that changing the node membership requires coordination in order to preserve the properties required of the assignment.

Figure 8: Comparison of the load distribution efficiency of different strategies for system with 30 nodes and N=3 with equal amount of metadata maintained at each node. The values of the system size and number of replicas are based on the typical configuration deployed for majority of our services.

6.3 Divergent Versions: When and How Many?

As noted earlier, Dynamo is designed to tradeoff consistency for availability. To understand the precise impact of different failures on consistency, detailed data is required on multiple factors: outage length, type of failure, component reliability, workload etc. Presenting these numbers in detail is outside of the scope of this paper. However, this section discusses a good summary metric: the number of divergent versions seen by the application in a live production environment.

Divergent versions of a data item arise in two scenarios. The first is when the system is facing failure scenarios such as node failures, data center failures, and network partitions. The second is when the system is handling a large number of concurrent writers to a single data item and multiple nodes end up coordinating the updates concurrently. From both a usability and efficiency perspective, it is preferred to keep the number of divergent versions at any given time as low as possible. If the versions cannot be syntactically reconciled based on vector clocks alone, they have to be passed to the business logic for semantic reconciliation. Semantic reconciliation introduces additional load on services, so it is desirable to minimize the need for it.

In our next experiment, the number of versions returned to the shopping cart service was profiled for a period of 24 hours. During this period, 99.94% of requests saw exactly one version; 0.00057% of requests saw 2 versions; 0.00047% of requests saw 3 versions and 0.00009% of requests saw 4 versions. This shows that divergent versions are created rarely.

Experience shows that the increase in the number of divergent versions is contributed not by failures but due to the increase in number of concurrent writers. The increase in the number of concurrent writes is usually triggered by busy robots (automated client programs) and rarely by humans. This issue is not discussed in detail due to the sensitive nature of the story.

6.4 Client-driven or Server-driven Coordination

As mentioned in Section 5, Dynamo has a request coordination component that uses a state machine to handle incoming requests. Client requests are uniformly assigned to nodes in the ring by a load balancer. Any Dynamo node can act as a coordinator for a read request. Write requests on the other hand will be coordinated by a node in the key’s current preference list. This restriction is due to the fact that these preferred nodes have the added responsibility of creating a new version stamp that causally subsumes the version that has been updated by the write request. Note that if Dynamo’s versioning scheme is based on physical timestamps, any node can coordinate a write request.

An alternative approach to request coordination is to move the state machine to the client nodes. In this scheme client applications use a library to perform request coordination locally. A client periodically picks a random Dynamo node and downloads its current view of Dynamo membership state. Using this information the client can determine which set of nodes form the preference list for any given key. Read requests can be coordinated at the client node thereby avoiding the extra network hop that is incurred if the request were assigned to a random Dynamo node by the load balancer. Writes will either be forwarded to a node in the key’s preference list or can be coordinated locally if Dynamo is using timestamps based versioning.

An important advantage of the client-driven coordination approach is that a load balancer is no longer required to uniformly distribute client load. Fair load distribution is implicitly guaranteed by the near uniform assignment of keys to the storage nodes. Obviously, the efficiency of this scheme is dependent on how fresh the membership information is at the client. Currently clients poll a random Dynamo node every 10 seconds for membership updates. A pull based approach was chosen over a push based one as the former scales better with large number of clients and requires very little state to be maintained at servers regarding clients. However, in the worst case the client can be exposed to stale membership for duration of 10 seconds. In case, if the client detects its membership table is stale (for instance, when some members are unreachable), it will immediately refresh its membership information.

Table 2 shows the latency improvements at the 99.9th percentile and averages that were observed for a period of 24 hours using client-driven coordination compared to the server-driven approach. As seen in the table, the client-driven coordination approach reduces the latencies by at least 30 milliseconds for 99.9th percentile latencies and decreases the average by 3 to 4 milliseconds. The latency improvement is because the client-driven approach eliminates the overhead of the load balancer and the extra network hop that may be incurred when a request is assigned to a random node. As seen in the table, average latencies tend to be significantly lower than latencies at the 99.9th percentile. This is because Dynamo’s storage engine caches and write buffer have good hit ratios. Moreover, since the load balancers and network introduce additional variability to the response time, the gain in response time is higher for the 99.9th percentile than the average.

Table 2: Performance of client-driven and server-driven coordination approaches.

99.9th percentile read latency (ms)

99.9th percentile write latency (ms)

Average read latency (ms)

Average write latency (ms)

Server-driven

68.9

68.5

3.9

4.02

Client-driven

30.4

1.55

1.9

6.5 Balancing background vs. foreground tasks

Each node performs different kinds of background tasks for replica synchronization and data handoff (either due to hinting or adding/removing nodes) in addition to its normal foreground put/get operations. In early production settings, these background tasks triggered the problem of resource contention and affected the performance of the regular put and get operations. Hence, it became necessary to ensure that background tasks ran only when the regular critical operations are not affected significantly. To this end, the background tasks were integrated with an admission control mechanism. Each of the background tasks uses this controller to reserve runtime slices of the resource (e.g. database), shared across all background tasks. A feedback mechanism based on the monitored performance of the foreground tasks is employed to change the number of slices that are available to the background tasks.

The admission controller constantly monitors the behavior of resource accesses while executing a "foreground" put/get operation. Monitored aspects include latencies for disk operations, failed database accesses due to lock-contention and transaction timeouts, and request queue wait times. This information is used to check whether the percentiles of latencies (or failures) in a given trailing time window are close to a desired threshold. For example, the background controller checks to see how close the 99th percentile database read latency (over the last 60 seconds) is to a preset threshold (say 50ms). The controller uses such comparisons to assess the resource availability for the foreground operations. Subsequently, it decides on how many time slices will be available to background tasks, thereby using the feedback loop to limit the intrusiveness of the background activities. Note that a similar problem of managing background tasks has been studied in [4].

6.6 Discussion

This section summarizes some of the experiences gained during the process of implementation and maintenance of Dynamo. Many Amazon internal services have used Dynamo for the past two years and it has provided significant levels of availability to its applications. In particular, applications have received successful responses (without timing out) for 99.9995% of its requests and no data loss event has occurred to date.

Moreover, the primary advantage of Dynamo is that it provides the necessary knobs using the three parameters of (N,R,W) to tune their instance based on their needs.. Unlike popular commercial data stores, Dynamo exposes data consistency and reconciliation logic issues to the developers. At the outset, one may expect the application logic to become more complex. However, historically, Amazon’s platform is built for high availability and many applications are designed to handle different failure modes and inconsistencies that may arise. Hence, porting such applications to use Dynamo was a relatively simple task. For new applications that want to use Dynamo, some analysis is required during the initial stages of the development to pick the right conflict resolution mechanisms that meet the business case appropriately. Finally, Dynamo adopts a full membership model where each node is aware of the data hosted by its peers. To do this, each node actively gossips the full routing table with other nodes in the system. This model works well for a system that contains couple of hundreds of nodes. However, scaling such a design to run with tens of thousands of nodes is not trivial because the overhead in maintaining the routing table increases with the system size. This limitation might be overcome by introducing hierarchical extensions to Dynamo. Also, note that this problem is actively addressed by O(1) DHT systems(e.g., [14]).

7. Conclusions

This paper described Dynamo, a highly available and scalable data store, used for storing state of a number of core services of Amazon.com’s e-commerce platform. Dynamo has provided the desired levels of availability and performance and has been successful in handling server failures, data center failures and network partitions. Dynamo is incrementally scalable and allows service owners to scale up and down based on their current request load. Dynamo allows service owners to customize their storage system to meet their desired performance, durability and consistency SLAs by allowing them to tune the parameters N, R, and W.

The production use of Dynamo for the past year demonstrates that decentralized techniques can be combined to provide a single highly-available system. Its success in one of the most challenging application environments shows that an eventual-consistent storage system can be a building block for highly-available applications.

ACKNOWLEDGEMENTS

The authors would like to thank Pat Helland for his contribution to the initial design of Dynamo. We would also like to thank Marvin Theimer and Robert van Renesse for their comments. Finally, we would like to thank our shepherd, Jeff Mogul, for his detailed comments and inputs while preparing the camera ready version that vastly improved the quality of the paper.

References

[1] Adya, A., Bolosky, W. J., Castro, M., Cermak, G., Chaiken, R., Douceur, J. R., Howell, J., Lorch, J. R., Theimer, M., and Wattenhofer, R. P. 2002. Farsite: federated, available, and reliable storage for an incompletely trusted environment. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 1-14.

[2] Bernstein, P.A., and Goodman, N. An algorithm for concurrency control and recovery in replicated distributed databases. ACM Trans. on Database Systems, 9(4):596-615, December 1984

[3] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2006. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation - Volume 7(Seattle, WA, November 06 - 08, 2006). USENIX Association, Berkeley, CA, 15-15.

[4] Douceur, J. R. and Bolosky, W. J. 2000. Process-based regulation of low-importance processes.SIGOPS Oper. Syst. Rev. 34, 2 (Apr. 2000), 26-27.

[5] Fox, A., Gribble, S. D., Chawathe, Y., Brewer, E. A., and Gauthier, P. 1997. Cluster-based scalable network services. In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles(Saint Malo, France, October 05 - 08, 1997). W. M. Waite, Ed. SOSP '97. ACM Press, New York, NY, 78-91.

[6] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM Press, New York, NY, 29-43.

[7] Gray, J., Helland, P., O'Neil, P., and Shasha, D. 1996. The dangers of replication and a solution. InProceedings of the 1996 ACM SIGMOD international Conference on Management of Data (Montreal, Quebec, Canada, June 04 - 06, 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New York, NY, 173-182.

[8] Gupta, I., Chandra, T. D., and Goldszmidt, G. S. 2001. On scalable and efficient distributed failure detectors. In Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing (Newport, Rhode Island, United States). PODC '01. ACM Press, New York, NY, 170-179.

[9] Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Wells, C., and Zhao, B. 2000. OceanStore: an architecture for global-scale persistent storage. SIGARCH Comput. Archit. News 28, 5 (Dec. 2000), 190-201.

[10] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and Lewin, D. 1997. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on theory of Computing (El Paso, Texas, United States, May 04 - 06, 1997). STOC '97. ACM Press, New York, NY, 654-663.

[11] Lindsay, B.G., et. al., “Notes on Distributed Databases”, Research Report RJ2571(33471), IBM Research, July 1979

[12] Lamport, L. Time, clocks, and the ordering of events in a distributed system. ACM Communications, 21(7), pp. 558-565, 1978.

[13] Merkle, R. A digital signature based on a conventional encryption function. Proceedings of CRYPTO, pages 369–378. Springer-Verlag, 1988.

[14] Ramasubramanian, V., and Sirer, E. G. Beehive: O(1)lookup performance for power-law query distributions in peer-to-peer overlays. In Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, San Francisco, CA, March 29 - 31, 2004.

[15] Reiher, P., Heidemann, J., Ratner, D., Skinner, G., and Popek, G. 1994. Resolving file conflicts in the Ficus file system. In Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference - Volume 1 (Boston, Massachusetts, June 06 - 10, 1994). USENIX Association, Berkeley, CA, 12-12..

[16] Rowstron, A., and Druschel, P. Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems.Proceedings of Middleware, pages 329-350, November, 2001.

[17] Rowstron, A., and Druschel, P. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. Proceedings of Symposium on Operating Systems Principles, October 2001.

[18] Saito, Y., Frølund, S., Veitch, A., Merchant, A., and Spence, S. 2004. FAB: building distributed enterprise disk arrays from commodity components. SIGOPS Oper. Syst. Rev. 38, 5 (Dec. 2004), 48-58.

[19] Satyanarayanan, M., Kistler, J.J., Siegel, E.H. Coda: A Resilient Distributed File System. IEEE Workshop on Workstation Operating Systems, Nov. 1987.

[20] Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols For Computer Communications (San Diego, California, United States). SIGCOMM '01. ACM Press, New York, NY, 149-160.

[21] Terry, D. B., Theimer, M. M., Petersen, K., Demers, A. J., Spreitzer, M. J., and Hauser, C. H. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (Copper Mountain, Colorado, United States, December 03 - 06, 1995). M. B. Jones, Ed. SOSP '95. ACM Press, New York, NY, 172-182.

[22] Thomas, R. H. A majority consensus approach to concurrency control for multiple copy databases. ACM Transactions on Database Systems 4 (2): 180-209, 1979.

[23] Weatherspoon, H., Eaton, P., Chun, B., and Kubiatowicz, J. 2007. Antiquity: exploiting a secure log for wide-area distributed storage. SIGOPS Oper. Syst. Rev. 41, 3 (Jun. 2007), 371-384.

[24] Welsh, M., Culler, D., and Brewer, E. 2001. SEDA: an architecture for well-conditioned, scalable internet services. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles(Banff, Alberta, Canada, October 21 - 24, 2001). SOSP '01. ACM Press, New York, NY, 230-243.

#BigData #Dynamo #Werner #Amazon

0 notes

vikasgoel · 11 years

Text

Nice article explaining data consistency models to keep in mind when working with BigData by @werner

Eventually Consistent - Revisited

By Werner Vogels on 22 December 2008 04:15 PM | Permalink | Comments (14)

I wrote a first version of this posting on consistency models about a year ago, but I was never happy with it as it was written in haste and the topic is important enough to receive a more thorough treatment. ACM Queue asked me to revise it for use in their magazine and I took the opportunity to improve the article. This is that new version.

Eventually Consistent - Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.

At the foundation of Amazon's cloud computing are infrastructure services such as Amazon's S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic Compute Cloud) that provide the resources for constructing Internet-scale computing platforms and a great variety of applications. The requirements placed on these infrastructure services are very strict; they need to score high marks in the areas of security, scalability, availability, performance, and cost effectiveness, and they need to meet these requirements while serving millions of customers around the globe, continuously.

Under the covers these services are massive distributed systems that operate on a worldwide scale. This scale creates additional challenges, because when a system processes trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and need to be accounted for up front in the design and architecture of the system. Given the worldwide scope of these systems, we use replication techniques ubiquitously to guarantee consistent performance and high availability. Although replication brings us closer to our goals, it cannot achieve them in a perfectly transparent manner; under a number of conditions the customers of these services will be confronted with the consequences of using replication techniques inside the services.

One of the ways in which this manifests itself is in the type of data consistency that is provided, particularly when the underlying distributed system provides an eventual consistency model for data replication. When designing these large-scale systems at Amazon, we use a set of guiding principles and abstractions related to large-scale data replication and focus on the trade-offs between high availability and data consistency. In this article I present some of the relevant background that has informed our approach to delivering reliable distributed systems that need to operate on a global scale. An earlier version of this text appeared as a posting on the All Things Distributed weblog in December 2007 and was greatly improved with the help of its readers.

Historical Perspective

In an ideal world there would be only one consistency model: when an update is made all observers would see that update. The first time this surfaced as difficult to achieve was in the database systems of the late '70s. The best "period piece" on this topic is "Notes on Distributed Databases" by Bruce Lindsay et al. 5 It lays out the fundamental principles for database replication and discusses a number of techniques that deal with achieving consistency. Many of these techniques try to achieve distribution transparency—that is, to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many systems during this time took the approach that it was better to fail the complete system than to break this transparency.2

In the mid-'90s, with the rise of larger Internet systems, these practices were revisited. At that time people began to consider the idea that availability was perhaps the most important property of these systems, but they were struggling with what it should be traded off against. Eric Brewer, systems professor at the University of California, Berkeley, and at that time head of Inktomi, brought the different trade-offs together in a keynote address to the PODC (Principles of Distributed Computing) conference in 2000.1He presented the CAP theorem, which states that of three properties of shared-data systems—data consistency, system availability, and tolerance to network partition—only two can be achieved at any given time. A more formal confirmation can be found in a 2002 paper by Seth Gilbert and Nancy Lynch.4

A system that is not tolerant to network partitions can achieve data consistency and availability, and often does so by using transaction protocols. To make this work, client and storage systems must be part of the same environment; they fail as a whole under certain scenarios, and as such, clients cannot observe partitions. An important observation is that in larger distributed-scale systems, network partitions are a given; therefore, consistency and availability cannot be achieved at the same time. This means that there are two choices on what to drop: relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available.

Both options require the client developer to be aware of what the system is offering. If the system emphasizes consistency, the developer has to deal with the fact that the system may not be available to take, for example, a write. If this write fails because of system unavailability, then the developer will have to deal with what to do with the data to be written. If the system emphasizes availability, it may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write. The developer then has to decide whether the client requires access to the absolute latest update all the time. There is a range of applications that can handle slightly stale data, and they are served well under this model.

In principle the consistency property of transaction systems as defined in the ACID properties (atomicity, consistency, isolation, durability) is a different kind of consistency guarantee. In ACID, consistency relates to the guarantee that when a transaction is finished the database is in a consistent state; for example, when transferring money from one account to another the total amount held in both accounts should not change. In ACID-based systems, this kind of consistency is often the responsibility of the developer writing the transaction but can be assisted by the database managing integrity constraints.

Consistency—Client and Server

There are two ways of looking at consistency. One is from the developer/client point of view: how they observe data updates. The second way is from the server side: how updates flow through the system and what guarantees systems can give with respect to updates.

Client-side Consistency

The client side has these components:

A storage system. For the moment we'll treat it as a black box, but one should assume that under the covers it is something of large scale and highly distributed, and that it is built to guarantee durability and availability.

Process A. This is a process that writes to and reads from the storage system.

Processes B and C. These two processes are independent of process A and write to and read from the storage system. It is irrelevant whether these are really processes or threads within the same process; what is important is that they are independent and need to communicate to share information. Client-side consistency has to do with how and when observers (in this case the processes A, B, or C) see updates made to a data object in the storage systems. In the following examples illustrating the different types of consistency, process A has made an update to a data object:

Strong consistency. After the update completes, any subsequent access (by A, B, or C) will return the updated value.

Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be met before the value will be returned. The period between the update and the moment when it is guaranteed that any observer will always see the updated value is dubbed the inconsistency window.

Eventual consistency. This is a specific form of weak consistency; the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme. The most popular system that implements eventual consistency is DNS (Domain Name System). Updates to a name are distributed according to a configured pattern and in combination with time-controlled caches; eventually, all clients will see the update.

The eventual consistency model has a number of variations that are important to consider:

Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value, and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules.

Read-your-writes consistency. This is an important model where process A, after it has updated a data item, always accesses the updated value and will never see an older value. This is a special case of the causal consistency model.

Session consistency. This is a practical version of the previous model, where a process accesses the storage system in the context of a session. As long as the session exists, the system guarantees read-your-writes consistency. If the session terminates because of a certain failure scenario, a new session needs to be created and the guarantees do not overlap the sessions.

Monotonic read consistency. If a process has seen a particular value for the object, any subsequent accesses will never return any previous values.

Monotonic write consistency. In this case the system guarantees to serialize the writes by the same process. Systems that do not guarantee this level of consistency are notoriously hard to program.

A number of these properties can be combined. For example, one can get monotonic reads combined with session-level consistency. From a practical point of view these two properties (monotonic reads and read-your-writes) are most desirable in an eventual consistency system, but not always required. These two properties make it simpler for developers to build applications, while allowing the storage system to relax consistency and provide high availability.

As you can see from these variations, quite a few different scenarios are possible. It depends on the particular applications whether or not one can deal with the consequences.

Eventual consistency is not some esoteric property of extreme distributed systems. Many modern RDBMSs (relational database management systems) that provide primary-backup reliability implement their replication techniques in both synchronous and asynchronous modes. In synchronous mode the replica update is part of the transaction. In asynchronous mode the updates arrive at the backup in a delayed manner, often through log shipping. In the latter mode if the primary fails before the logs are shipped, reading from the promoted backup will produce old, inconsistent values. Also to support better scalable read performance, RDBMSs have started to provide the ability to read from the backup, which is a classical case of providing eventual consistency guarantees in which the inconsistency windows depend on the periodicity of the log shipping.

Server-side Consistency

On the server side we need to take a deeper look at how updates flow through the system to understand what drives the different modes that the developer who uses the system can experience. Let's establish a few definitions before getting started:

N = the number of nodes that store replicas of the data

W = the number of replicas that need to acknowledge the receipt of the update before the update completes

R = the number of replicas that are contacted when a data object is accessed through a read operation

If W+R > N, then the write set and the read set always overlap and one can guarantee strong consistency. In the primary-backup RDBMS scenario, which implements synchronous replication, N=2, W=2, and R=1. No matter from which replica the client reads, it will always get a consistent answer. In asynchronous replication with reading from the backup enabled, N=2, W=1, and R=1. In this case R+W=N, and consistency cannot be guaranteed.

The problems with these configurations, which are basic quorum protocols, is that when the system cannot write to W nodes because of failures, the write operation has to fail, marking the unavailability of the system. With N=3 and W=3 and only two nodes available, the system will have to fail the write.

In distributed-storage systems that need to provide high performance and high availability, the number of replicas is in general higher than two. Systems that focus solely on fault tolerance often use N=3 (with W=2 and R=2 configurations). Systems that need to serve very high read loads often replicate their data beyond what is required for fault tolerance; N can be tens or even hundreds of nodes, with R configured to 1 such that a single read will return a result. Systems that are concerned with consistency are set to W=N for updates, which may decrease the probability of the write succeeding. A common configuration for these systems that are concerned about fault tolerance but not consistency is to run with W=1 to get minimal durability of the update and then rely on a lazy (epidemic) technique to update the other replicas.

How to configure N, W, and R depends on what the common case is and which performance path needs to be optimized. In R=1 and N=W we optimize for the read case, and in W=1 and R=N we optimize for a very fast write. Of course in the latter case, durability is not guaranteed in the presence of failures, and if W < (N+1)/2, there is the possibility of conflicting writes when the write sets do not overlap.

Weak/eventual consistency arises when W+R <= N, meaning that there is a possibility that the read and write set will not overlap. If this is a deliberate configuration and not based on a failure case, then it hardly makes sense to set R to anything but 1. This happens in two very common cases: the first is the massive replication for read scaling mentioned earlier; the second is where data access is more complicated. In a simple key-value model it is easy to compare versions to determine the latest value written to the system, but in systems that return sets of objects it is more difficult to determine what the correct latest set should be. In most of these systems where the write set is smaller than the replica set, a mechanism is in place that applies the updates in a lazy manner to the remaining nodes in the replica's set. The period until all replicas have been updated is the inconsistency window discussed before. If W+R <= N, then the system is vulnerable to reading from nodes that have not yet received the updates.

Whether or not read-your-writes, session, and monotonic consistency can be achieved depends in general on the "stickiness" of clients to the server that executes the distributed protocol for them. If this is the same server every time, then it is relatively easy to guarantee read-your-writes and monotonic reads. This makes it slightly harder to manage load balancing and fault tolerance, but it is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about.

Sometimes the client implements read-your-writes and monotonic reads. By adding versions on writes, the client discards reads of values with versions that precede the last-seen version.

Partitions happen when some nodes in the system cannot reach other nodes, but both sets are reachable by groups of clients. If you use a classical majority quorum approach, then the partition that has W nodes of the replica set can continue to take updates while the other partition becomes unavailable. The same is true for the read set. Given that these two sets overlap, by definition the minority set becomes unavailable. Partitions don't happen frequently, but they do occur between data centers, as well as inside data centers.

In some applications the unavailability of any of the partitions is unacceptable, and it is important that the clients that can reach that partition make progress. In that case both sides assign a new set of storage nodes to receive the data, and a merge operation is executed when the partition heals. For example, within Amazon the shopping cart uses such a write-always system; in the case of partition, a customer can continue to put items in the cart even if the original cart lives on the other partitions. The cart application assists the storage system with merging the carts once the partition has healed.

Amazon's Dynamo

A system that has brought all of these properties under explicit control of the application architecture isAmazon's Dynamo, a key-value storage system that is used internally in many services that make up the Amazon e-commerce platform, as well as Amazon's Web Services. One of the design goals of Dynamo is to allow the application service owner who creates an instance of the Dynamo storage system—which commonly spans multiple data centers—to make the trade-offs between consistency, durability, availability, and performance at a certain cost point.3

Summary

Data inconsistency in large-scale reliable distributed systems has to be tolerated for two reasons: improving read and write performance under highly concurrent conditions; and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running.

Whether or not inconsistencies are acceptable depends on the client application. In all cases the developer needs to be aware that consistency guarantees are provided by the storage systems and need to be taken into account when developing applications. There are a number of practical improvements to the eventual consistency model, such as session-level consistency and monotonic reads, which provide better tools for the developer. Many times the application is capable of handling the eventual consistency guarantees of the storage system without any problem. A specific popular case is a Web site in which we can have the notion of user-perceived consistency. In this scenario the inconsistency window needs to be smaller than the time expected for the customer to return for the next page load. This allows for updates to propagate through the system before the next read is expected.

The goal of this article is to raise awareness about the complexity of engineering systems that need to operate at a global scale and that require careful tuning to ensure that they can deliver the durability, availability, and performance that their applications require. One of the tools the system designer has is the length of the consistency window, during which the clients of the systems are possibly exposed to the realities of large-scale systems engineering.

References

Brewer, E. A. 2000. Towards robust distributed systems (abstract). In Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (July 16-19, Portland, Oregon): 7

A Conversation with Bruce Lindsay. 2004. ACM Queue 2(8): 22-33.

DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (Stevenson, Washington, October).

Gilbert , S., Lynch, N. 2002. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant Web services. ACM SIGACT News 33(2).

Lindsay, B. G., Selinger, P. G., et al. 1980. Notes on distributed databases. In Distributed Data Bases, ed. I. W. Draffan and F. Poole, 247-284. Cambridge: Cambridge University Press. Also available as IBM Research Report RJ2517, San Jose, California (July 1979).

#BigData #Dynamo

0 notes