programmingisterrible - Tumblr blog

programmingisterrible · 3 years ago

Text

Questions I have been asked about photography

... and some answers I have given which may or may not help you.

I haven't been coding much, instead I've been enjoying nature photography, and sharing it on twitter. It's why I haven't been blogging here as much as I used to.

As a result, I find myself being asked about photography far more than I am about coding, and so much like with coding, I decided to write up my answers into a longer essay.

In some ways, it's a return to form. I didn't start this blog to talk about computers, I started this blog to capture the discussions I was having, to avoid repeating myself. It just so happens I talk a lot more about cameras than computers now.

I know this isn't about programming, but I do hope that some programmers may find this useful.

Anyway:

What's the best way to learn photography?

Practice.

I'm sorry, but it's true. You need to abandon the idea of taking perfect photographs and give in to the numbers game.

Now, there's some photographers who like to spend three hours measuring light before taking a picture, there's other photographers who like to spend three hours processing negatives before making a print, and there's always someone who enjoys using a computer and post-processing the fuck out of a picture. That's fine and good, but it's not the best way to start out.

Yes, these are all useful skills to have, but at the end of the day, good photography often comes down to luck, as much as it comes down to preparation, experience, and equipment. If you get one good shot out of every ten, then you'll get a lot more good shots if you take a couple of thousand pictures.

So grab a camera and just take a bunch of shitty photos. If you don't have fun sucking at it, you'll never invest the time to be good at it.

What's the best camera for a beginner?

The one you're carrying.

Sometimes this means buying a camera you'll want to take with you. A camera you enjoy using is always going to be better than a camera you leave at home, no matter the specs. This is how I and many others have justified a more expensive piece of equipment, but if it makes you happy, and you can afford it, why not?

On the other hand, mobile phones are incredibly good point and shoot cameras, and will continue to get better. I have a lot of fancy digital and film cameras but I still use my phone for those everyday moments.

A phone camera is a great camera, and anyone who tells you it's not "real photography" is one of those people who bought a camera to take photos of, rather than photos with. A collectable over a tool. Dorks.

Will a better camera improve my photos?

Theoretically, yes. In practice, not significantly. I've seen incredible photos from disposable cameras, and marvels from phone cameras. I've also seen some of the most dull pieces of shit out of the most expensive gear.

A good photo has good lighting first and foremost. A good photo has good framing too. A good often photo has some emotion, expression, or action, and sometimes a good photo was taken on a good camera.

There are exceptions. If you want to do astrophotography, you'll want a camera that works in low light. If you do nature photography, you'll want a camera with snappy autofocus. If you do portrait work, you may end up picking the lens you want and then settling on a camera that fits.

Even so, you'll still need to get the hang of lighting, framing, and expression before you need to worry about equipment.

What about video?

Ask a video person.

All I know is that most stills cameras overheat very quickly if you're shooting video, that, and things like managing focus breathing, or autofocus tracking speed become very important very quickly.

In other words: Although you can shoot video on a stills camera, they vary greatly in how good they are at it.

I do have one piece of advice for those of you using a fancy camera as a webcam:

If you're annoyed by your glasses reflecting light in your webcam (or showing everyone on zoom that you've switched tabs), see if you can find a circular polarizing filter. They rotate in place and eventually you'll find the right adjustment to hide reflections.

So... what camera should I buy?

It depends. [Audience sighs]

You buy one that makes you happy, and I can't begin to guess what that might be.

It is worth nothing that price, weight, and size usually factor into that decision, but there are a lot of different cameras out there, and frankly, there is no "magical best camera", film or digital. Some cameras are better at video, some are better at stills. Some cameras are great at low light, some are tiny little marvels, and some are huge gigantic slabs of metal that have broken out in a pox of knobs, dials, and switches.

The other thing worth noting is that with some cameras, you're buying into a system—so you should factor in the cost of alternative lenses into your calculation. Even if the body is cheap, it doesn't really matter if all the lenses you want are out of your price range.

If you're looking for video, or lightweight, micro four thirds can be a good choice. M43 is especially good for nature photography on a budget. If you're wanting vast amounts third party lenses, and great autofocus, Sony might be for you. If you've already got a bunch of Canon or Nikon lenses kicking around, then the choice will already be made for you. On the other hand, if you're wanting a good all-round camera that's not too big, not too expensive, and gets you excellent JPEGs out of the box, Fujifilm might be just right.

It really does depend.

Ok, ok, what about the lens?

Lenses come in two basic types. Zoom and Prime lenses. A zoom lets you, well, um, zoom in and out, and change the framing without having to move around, and a prime, well, you gotta move your feet if you don't like what you're seeing.

Primes are usually better at low light, usually lighter than a zoom, and often are better at autofocus and overall image quality. Some people prefer using prime lenses, and some people prefer the convenience of having one or two lenses to cover a whole range of potential shots. For example, Nature people tend to like zooms, street and portrait people tend to like primes, but there's no hard or fast rules about which lenses you have to use.

Usually the more expensive the lens, the sharper it is, and the better at low light it is, but sometimes you will be just be paying for a brand name. Lenses by the same manufacturer as the camera usually work a lot better than third party ones, but not always. Once again, it depends.

All lenses, zoom or prime, have a focal length (or a range of lengths) which gives you an idea of the angle of view you'll get. Unfortunately, once again, it depends. The actual angle of view you get from a lens depends on how big the sensor is on a camera.

This is why i'll be talking about "35mm equivalent" or "full frame equivalent" focal lengths, so if you're thinking about a micro four thirds camera or APS-C camera (smaller sensor), the numbers might look a little different.

Anyway:

A lens below 20mm is a "super wide" lens, and they're great for night sky, landscape or architecture shots. A 14mm lens makes a tiny cupboard shaped room feel big and breezy, and they can be great for crowd shots at events too.

A 28mm lens is what your phone already has. It's great for selfies. If you're doing video, you probably want something between 20mm and 28mm, or stand quite far back from the camera.

A 28-75mm kit zoom is the sort of lens you get with your camera, use once or twice, and then buy a lens you enjoy using, after working out which focal lengths you tend to shoot at. You won't regret buying one, and they're often very useful for event photography.

A 35mm, 40mm, 50mm (full frame equivalent) lens is a solid investment. Most camera makers have a cheap 50mm and you really won't regret buying one. A so called "normal lens" is good for landscapes, portraits, and casual photography.

A lens between 65mm and 135mm is really good for portraits, but they're often much bulkier and more expensive than shorter lenses. You can get zoom lenses that cover 70-200mm and they're usually quite good for event photography.

Any lens longer than 200mm is for nature photography. You probably want a zoom lens like 100-400mm or 200-600mm if you're serious about taking pictures of birds. The longer the lens, the heavier the lens, and eventually you'll end up lugging around a tripod just to keep the damn thing stable.

It's a minor note, but technically these big lenses are called "long lenses". A telephoto lens is one where the focal length is longer than the actual lens itself. You also can get a variety of special purpose lenses, like fisheyes, soft-focus lenses, or even things like "smooth trans focus" lenses. You probably don't need to worry about any of this, but it is worth mentioning.

Anyway: If you don't know, buy the kit lens (a 28-75mm), and a 'nifty fifty' 50mm. You'll work out the rest as you go along. We'll get to what those f-numbers on lenses mean in a little moment, but if you're impatient, scroll down, I'm not a document cop.

Do I shoot in Auto? JPEG? Raw?

Auto is fine. JPEG is fine. Raw is often overkill, despite all the posturing men online saying otherwise. You have my permission to ignore any and all advice about photography online, including the advice here.

There is one thing worth mentioning though: White Balance.

Digital cameras don't always know warm or cold to make a photograph look. If you've ever tried taking a picture at sunset and wondered why everything turned out blue, then white balance is what's caused it.

The problem with JPEG and Auto is that it doesn't always get things right, and in some cases (weird lighting), it can get things disastrously wrong, and it can be very hard to edit a JPEG to fix these issues. This is why people who do weddings and events regularly shoot in RAW, they have no second chances to take a photograph, and RAW is the only way to ensure you can fix whatever went wrong.

On the other hand, if you're not at a wedding, you can just take another shot, tweak the settings, and try again. You can just cheat and avoid white balance issues by shooting in black and white. If you have a JPEG that's too warm or too cold, changing it to monochrome often leaves you with a better picture.

Still, it's nice to get it right the first time around.

You can set white balance from presets (Sunny, Cloudy...) as well as setting the direct color temperature in Kelvin. You can also use a test card, or a special lens filter to configure the color temperature from a reference, if you really want to get things right.

For the most part, leaving things on AUTO will work most of the time, but it useful to know how to adjust it when things go wrong.

The other big problem with AUTO is when someone is backlit, but we'll get on how to fix that in a moment.

I want to shoot manually.

You probably don't.

Light meters have been around for almost as long as photography, and professional photographers use light metering to ensure the shot comes out perfectly. Your camera is just saving you some time and effort by putting in the settings the light meter recommends.

Then again, sometimes you do want to shoot manually. Sort-of.

First, we'll have to get some understanding of what the settings do. ISO, Shutter Speed, and Aperture control how the camera handles light, and what tradeoffs it makes when there isn't enough light to go around.

ISO is the easiest one. A lower ISO (like 50) gives a darker image, but with less noise. A higher ISO (like 3200) gives a brighter image, but far more noise as a result. If you're using a film camera, you can pretend I said "grain" instead of noise, but it's pretty much the same thing.

This is why all punk rock gig photos look like a grainy mess. There's just not enough light to take a clear picture.

Shutter speed is also pretty easy to understand. If you have a slower shutter speed (like 1/30s), it lets more light into the camera, but you get more motion blur as a result. A faster shutter speed (1/2000) gives a darker, but sharper image. This is why ye-olde cameras needed people to sit still for several minutes, to gather enough light to get a decent photograph.

Aperture is a little more difficult to understand, because the scale is weird and the terminology is confusing. The aperture is the little iris inside the lens that opens up and closes down, to let more or less light inside. It's measured in f-numbers, which go on a scale of root-2 (2, 2.8, 4, 5.6, 11, 16), but a low aperture means a large opening, and a large aperture means a small opening.

Which is why I tend to talk about an "open" or "wider" aperture (a small f-number, and a big opening) and a "closed" aperture (a big f-number, and a small opening), because I get confused when people say "a larger aperture" and I'm never sure if they mean "a larger opening" or "a larger f-number" (and thus a smaller opening).

Anyway.

An open aperture lets more light in, but gives you a smaller depth of field. Only a small section of the image will be in crisp focus, and the background will be blurry. A closed aperture lets less light in, but you get a much wider depth of field, and if you close it down enough, you can get almost everything in focus.

I find this a little counter intuitive because you let in less light to see more of the image, but physics isn't my strong suit.

To recap:

ISO trades light for noise. A higher ISO is a brighter image, but has more noise.

Shutter trades light for motion blur. A slower shutter has a brighter image, but more motion blur.

Aperture trades light for background blur. A wider aperture (low f-number) has a brighter picture, but only some things will be in focus.

Sometimes we use fast or slow instead, but it's a little confusing. A faster ISO lets more light in, and a faster lens lets more light in, but a faster shutter lets less light in. I'm sorry about that.

Any change to one of these must be reflected in the other. You let more light in with the aperture? You need to use a faster shutter or lower ISO to compensate. You use a higher iso? You need a faster shutter or to close up the aperture. It doesn't help that each one is in a different scale. ISO goes 50, 100, 200. Shutters go 1/500, 1/250, 1/125. Aperture goes f/8, f/5.6, f/4, f/2.8.

Your camera knows all of this and can help you. This is why you don't want to shoot manually.

If you want to shoot on automatic, but control the motion blur you can use shutter-priority, and your camera changes the other settings to match. If you want to control the depth of field, you shoot on aperture priority, and once again your camera will pick the right settings. If for some reason you want to control aperture and shutter, but not ISO, you can do that too but it's often a little more involved.

There's also a third mode: Programmed Auto. It's like full auto but when you spin the dial, it lets you pick which shutter and aperture combination you want. Spin it one way and get a thin depth of field, spin it the other and avoid motion blur.

In summary: Auto is great, but Aperture Priority and Shutter Priority are great too.

If you are wanting to learn to shoot manually, it is a lot easier to try using aperture priority first, and seeing how the settings change.

How do I learn to shoot manually?

Ok. Ok.

You know how I said "all the scales are different". Well I lied. Somewhat. They are all different, but they all measure the same thing, the amount of light the camera receives.

When you double the ISO (say 50 to 100), it can see twice as much light as before. When you double the shutter (say 1/125 to 1/250), you half the amount of light that gets into the camera. When you, uh, multiply the aperture by the square root of two (i.e f/2 to f/2.8, or f/2.8 to f/4) you double the amount of light being let in.

Thinking about doubling and halving the light is so common in photography that we have a special term for it. A stop. For example, one might say "Stopping down a lens" to mean going from f/2 to f/2.8. Going up a stop or down a stop is doubling or halving the amount of light.

Which is useful because you can't say "double the aperture" in the same way you can say "double the shutter speed", but you can say "going up a stop" and "down a stop" across any of the settings. So if you stop down the lens by one stop (halving light), you'll need to open up the shutter by a stop to compensate (doubling light). Or change the ISO.

There's often a "Exposure Compensation" dial on cameras, or a setting buried in a menu somewhere. It lets you over or underexpose a shot by a few stops, which comes in handy when you have an extra dark scene, or more commonly, you're taking a photo of someone who is backlit.

Anyway.

How do you learn to shoot manually?

You use a light meter to measure the thing you're going to photograph. You read off the settings it gives you, put them into your camera, and then adjust them together to ensure that the exposure remains the same.

If it sounds a lot like "Shutter Priority", "Aperture Priority", or "Programmed Auto", well, you're not wrong. For the most part, on a digital camera, and many analogue cameras, you can lean on the inbuilt light meter to do the work for you.

Unless you've decided to be a dork and bought yourself a fully analogue camera.

I can't blame you, I'm that dork too. You will want to get a light meter, and you may want to learn about the "Sunny 16 rule." Estimating the level of light in a scene is something you get better at doing from practice, and managing the exposure in a photo is one of those "real photographer skills" boring men on the internet keep banging on about.

It is useful, sure enough. Sometimes Auto doesn't do the right thing and you need to compensate. Sometimes you want to mess around with the settings by hand to see what's right, but taking well exposed boring shots the hard way doesn't make you a better photographer. Hopefully you'll have fun doing it, at least.

You've talked a lot about lighting, but nothing about framing, or expression

Thats because lighting is the most important detail. Framing is often more personal taste. Expression is often more being in the right place in the right time. Equipment is just something you buy to fill the void in life, telling yourself that it won't depreciate in value.

Lighting? Lighting is the most important thing. Understanding white balance will give you the tones you want in an image. Understanding exposure and understanding stops means you can trade one quality for another, and take a little more agency over what the photo looks like.

That said: Although lighting means you can take a photograph, what makes a photograph good is highly subjective. I have some blurry ass shit photos that capture a moment, show an emotion, and I love em to pieces.

Photography is a numbers game. Good photos happen because you're in the right place, and take a whole bunch of photographs. If you're not happy with your photos, you simply aren't taking enough of them.

he best way to take a lot of photos is to have fun. If this means using a cheap ass camera, go for it. If it means using some german fancy pants camera that makes funny noises, sure, burn your disposable income. If it means using the phone in your pocket, that's great too.

It's about having fun, that's the secret of taking a good photo.

That said, I do have some framing advice: If you're taking a photo of something with eyes, try and get on their eye-line to take a photo. In other words, crouch down if you're taking a picture of a duck. It looks way better.

I want to shoot film.

I congratulate you on having a lot of disposable income.

In theory, film photography is cheap. You can pick up a crappy camera for almost the same price of a roll of film, and get it developed for the same price. This is the lie you tell yourself when you get started.

Film is always more expensive in practice. If you shoot more than 1000 shots on a decent digital camera, it'll be cheaper than shooting on film. If you shoot one roll of film every month, after two and a bit years, you'll be spending more per shot than you would have with digital.

Let me be clear. You don't get into film because it's "cheaper", you get into film for one of two reasons.

You're an unbearable hipster, and you'd rather shoot Portra than use a film simulation preset. Nothing I can say will change how you feel, and you probably already know which Leica and/or Hasselblad you want. Go for it. I won't stop you. I hope you enjoy posting phone pics of the view through the waist level viewfinder, or posting videos of you loading expired film into the camera.

The other option is that you're looking for a mechanical stim toy. In which case, congratulations, you are about to make the investment of a lifetime. There are a lot of old cheap cameras out there, and they make some incredible noises.

Now, wait. You might tell me there's a third reason, or even a fourth. You are welcome to indulge in this cognitive dissonance, talking about "grain" you can simulate, talking about constraints you can impose yourself, but truthfully you're getting into film for telling people you're getting into film (the vibes), or you're getting into film because whirr click box make good noise (the stims).

Maybe I am being a bit mean. Sure enough some people get into it because of nostalgia, or they enjoy the process. Darkroom chemistry is fun, and making something happen without a computer is kinda magical after staring at the bad screen all week. Even so, you stick around for the vibes or the stims.

Anyway, if you're getting into film, let me give you some free advice.

You want to use 135/35mm film.

Medium format (120, not 120mm), large format (4x5 and higher), and subminiature formats can be harder to obtain, harder to process, and often more expensive than plain old 35mm film.

Black and white film is also a great place to start if you're unsure of which film to go for. Something like Tri-X, HP5, will cover a wide variety of uses, and is substantially cheaper than color options. Slide films (Velvia, Provia, E100, etc) are much less forgiving than any other type of film. Try them out, sure, but after you're comfortable.

The best film to start with might be XP2 Super. It's black and white, but you can send it to any lab that handles color film (It's chromogenic, and gets processed in C41 chemicals). XP2 can be shot at any iso between 50 and 800, and you can under and overexpose shots by a significant margin and get away with it.

It's great for using in those toy plastic cameras you bought on eBay while drunk.

At home development doesn't need a darkroom.

You need a tank, a changing bag, and some chemicals. Black and white film is very forgiving, unless you develop it in a DF96 monobath, in which case any slight variation of time, temperature, or the phase of the moon can give wildly different results. Really.

Don't be tricked by a monobath. It is not simpler, it is harder to get reproducible results, and it's more expensive than re-using fixer and stop bath in the long run. I've seen more rolls ruined by monobaths than I have by any other method.

Old cameras are janky as fuck, unless you buy from Japan.

The shutter won't always fire at the right speed, and the light meter will probably be broken or not work entirely. That is unless you buy from a Japanese eBay vendor.

It's a bit more expensive, but if you want a camera that's been tested, cleaned, and inspected, Japanese camera vendors are at the top of their game. There are some western stores with similar quality, but they're more few and far between.

An ideal first camera has aperture priority.

This is more of a personal opinion, but I think it's a lot easier to shoot film when you can lean on the camera to handle exposure at first. Built in light metering can take a lot of the spontaneity out of photography, which can be a bit clumsy when you're starting out.

You don't want a camera with a Selenium (battery free) meter. It won't work. You want a cadmium sensor (uses a battery) and you want a camera that takes batteries still in production.

Ideally, you'll want a camera that accepts LR44 batteries. Cameras that take silver cells require a constant voltage, which can lead to fun results when your batteries start to fade. Cameras that require mercury batteries require an expensive adaptor, or just ignoring the light meter and hoping for the best.

Find something made after 1980 and you should be fine.

It's worth remembering that a camera with cheap lenses is going to be more fun.

You can buy into old Nikon, Canon, or Leica gear, but you don't have to. Minolta, Konica, Kodak, Olympus, and a number of Russian companies made as good and in some cases better equipment. If you do decide to buy a SLR with interchangeable lenses, check around for the lenses first.

It may help you pick and choose what you want. I'm hesitant to recommend any cameras directly, as I don't want the hipsters to pick up on it. It's nice that there's still some affordable starter cameras out there.

Hipster cameras are often overrated. Rangefinders especially.

Rangefinder users are often about the vibes, and who can blame them. Nice small chunks of metal that are small enough to carry, and make a pleasing but subtle click as the shutter fires. The prices however are driven by collectable status, rather than practical experience.

Rangefinders suck ass when the light disappears. They're good for normal lenses (35-50mm), but aren't as good for very wide or very long lenses. Some rangefinders require you to focus and compose separately. Some rangefinders require you to cut the film in a specific way to load it. Some rangefinders don't even have a nice winding crank, and the stim just ain't as good.

On the other hand, a lot of SLRs have their bad points. Often the weight, and always the size, but if you're looking for a more general purpose film camera, an SLR is the right choice.

I say this as someone who owns several rangefinders, enjoys them, and even nails a shot at f/0.95 now and then: If you're getting into film, don't get a rangefinder.

Zone focusing is the ultimate point and shoot experience.

Focusing a lens comes in many forms, and it'd be worth explaining the different kinds.

Guessing.

That looks about 3 meters away, so I dial 3m into the lens and pray. Sometimes it works.

Ground glass.

You put a bit of glass in the back of the camera, and move the lens until the image appears in focus, and then swap it out for film. Large format and some medium format cameras handle this.

Twin lens reflex

You put another lens on the camera, and look through it to focus the other lens. Gears are involved. They're kinda quirky and cute, but much much bulkier.

Rangefinder.

There's a little device inside the viewfinder that has an overlapping image. As you adjust the focus on the lens, it adjusts the overlapping images. When you see no overlap, that's what the lens is focusing on. Some people love it, almost religiously.

Single lens reflex

There's one lens, a mirror, and a view finder. The mirror slaps out the way when you hit the shutter. Good times.

Zone focusing / Hyperfocal distances.

This is basically guessing, but by stopping down the lens, you get a wider depth of field. You can assume something is either 0.7m-1.5m away, 1.5-3m away, or 3m or over, and mostly get things right.

There is something liberating at just going "fuck it" and hitting the shutter. If you just want a chill time, find a camera with zone focusing. You select "Portrait", "Group" or "Landscape" and hope for the best. A zone focusing camera is a real fun party camera, one that other people can use without thinking.

Hyperfocal distancing is the posh version of zone focusing. Many lenses come with a little scale above the focus dial that says how wide the depth of field is at a given aperture. It's pretty much the same deal, but you're doing zone focusing by hand rather than a three mode selector.

Rangefinder camera owners regularly use zone focusing in order to capture shots quickly, because as it turns out, even with sufficient german engineering, manual focusing is slow.

Nighttime, indoors, and the tyranny of silver halides.

If you want to shoot at night time, or under artificial light, don't bother trying to shoot film. I mean, you can. You buy expensive film, learn how to push film, and get a lovely punk rock grainy mess. That, or you end up being the dork blinding everyone with a flashgun.

Analogue film just isn't great at low light shots. It's fun, but your bad photos will look terrible, and your good photos won't look great. I still enjoy it though, but I wouldn't recommend it to someone getting started.

Flash especially is hard to get right.

You point the flash directly at someone and you get the least flattering picture of them you'll ever see in your life. It's better to use bounce flash, a diffuser, or a soft box to light someone's face up.

In a pinch, you can use a cigarette paper over the flash as a cheap-ass diffuser. This works really well with phones too.

Anyway, if you really want to get into flash photography, well, you'll probably want a camera that supports TTL metering. That gets expensive, quickly. Old cameras and lenses have very slow flash sync speeds, often as bad as 1/30s, which can lead to a blurry mess if you're shooting handheld.

Flash photography is honestly such a deep subject that it deserves a whole other write up, but I'm not the right person to do that.

Scanning your own negatives is pain and suffering.

Really. It's bad. Awful. You dust and dust and dust and still there's specs on the film. Black and white film can't use automatic dust removal tech, but color film can. Now you have to calibrate your monitor and your scanner to ensure there's accurate colors coming through.

The suffering never ends with scanning. Pay someone else to do it.

Are you saying I should avoid film?

In the end, no-one cares if you shoot film, except for other film nerds. You should just know you're getting into an expensive, time consuming hobby.

I've no regrets.

17 notes · View notes

programmingisterrible · 5 years ago

Text

How to befriend Crows

There is no secret to making friends with crows. Like many things in life, it requires time, patience, and disposable income.

Step 1: Find Crows.

Crows tend to hang out where humans leave trash, so they’re everywhere.

Step 2: Leave food out for the crows to eat.

Crows are carrion-eaters, they’ll eat almost anything.

They like raw peanuts (in the shell, unsalted, not roasted either), they like suet, and they like wet and dry pet food. They love popcorn, too.

Don’t sit and watch them eat, it’s rude. That, and they'll think you're a predator.

You’re often better off dropping food and walking slowly away, until you earn their trust. Give it time.

Step 3: Repeat and wait.

Eventually, the crows begin to recognize you. Eventually, the crows won’t be as wary of you as they are other people. Still, even when the crows know you, they don’t come that close.

At the end of the day, the crows are wild animals. They have good reason to be wary of humans.

I’ve been feeding my local crows for over a year. They still keep their distance, but they will chase me around the park.

Summary

Step 1: Find crows.

Step 2: Leave food for them to eat, but don’t sit and watch them.

Step 3: Wait.

Pet food stores are your best bet for unsalted peanuts, suet too.

Once they begin to trust you, well, be less afraid of you, they’ll start to eat food when you’re nearby. From there, it’s just a slow and steady progression until the crows start following you home.

There isn’t much more to it—you just do kind things, and wait to earn their trust.

33 notes · View notes

programmingisterrible · 6 years ago

Text

Scaling in the presence of errors—don’t ignore them

Building a reliable, robust service often means building something that can keep working when some parts fail. A website where not every feature is available is often better than a website that’s entirely offline. Doing this in a meaningful way is not obvious.

The usual response is to hire more DBAs, more SREs, and even more folk in Support. Error handling, or making software that can recover from faults, often feels like the option of last resort—if ever considered in the first place.

The usual response to error handling is optimism. Unfortunately, the other choices aren’t exactly clear, and often difficult to choose from too. If you have two services, what do you do when one of them is offline: Try again later? Give up entirely? Or just ignore it and hope the problem goes away?

Surprisingly, all of these can be reasonable approaches. Even ignoring problems can work out for some systems. Sort-of. You don’t get to ignore errors, but sometimes recovering from an error can look very similar to ignoring it.

Imagine an orchard filled with wireless sensors for heat, light, and moisture. It makes little sense to try and resend old temperature readings upon error. It isn’t the sensor’s job to ensure the system works, and there’s very little a sensor can do about it, too. As a result, it’s pretty reasonable for a sensor to send messages with wild abandon—or in other words, fire-and-forget.

The idea behind fire-and-forget is that you don’t need to save old messages when the next message overrides it, or when a missing message will not cause problems. A situation where each message is treated as being the first message sent—forgetting that any attempt was made prior.

Done well, fire-and-forget is like a daily meeting—if someone misses the meeting, they can turn up the next day. Done badly, fire-and-forget is akin to replacing email with shouting across the office, hoping that someone else will take notes.

It isn’t that there’s no error handling in a fire-and-forget client, it’s that the best method of recovery is to just keep going. Unfortunately, people often misinterpret fire-and-forget to mean “avoid any error handling and hoping for the best”.

You don’t get to ignore errors.

When you ignore errors, you only put off discovering them—it’s not until another problem is caused that anyone even realises something has gone wrong. When you ignore errors, you waste time that could be spent recovering from them.

This is why, despite the occasional counter example, the best thing to do when encountering an error is to give up. Stop before you make anything worse and let something else handle it.

Giving up is a surprisingly reasonable approach to error handling, assuming something else will try to recover, restart, or resume the program. That’s why almost ever network service gets run in a loop—restarting immediately upon crashing, hoping the fault was transient. It often is.

There’s little point in trying to repeatedly connect to a database when the user is already mashing refresh in the browser. A unix pipeline could handle every possible bad outcome, but more often than not, running the program again makes everything work.

Although giving up is a good way to handle errors, restarting from scratch isn’t always the best way to recover from them.

Some pipelines work on large volumes or data, or do arduous amounts of numerical processing, and no-one is ever happy about repeating days or weeks or work. In theory, you could add error handling code, reduce the risk that the program will crash, and avoid an expensive restart, but in practice it’s often easier to restructure code to carry on where it left off.

In other words, give up, but save your progress to make restarting less time consuming.

For a pipeline, this usually entails a awful lot of temporary files—to save the output of each subcommand, and the result of splitting the input up into smaller batches. You can even retry things automatically, but for a lot of pipelines, manual recovery is still relatively inexpensive.

For other long running processes, this usually means something like checkpoints, or sagas. Or in other words, transforming a long running process into a short running one that’s run constantly, writing out the progress it makes to some file or database somewhere.

Over time, every long running process will get broken up into smaller parts, as restarting from scratch becomes prohibitively expensive. A long running process is just that more likely to run into an impossible error—full disks, no free memory, cosmic rays—and be forced to give up.

Sometimes the only way to handle an error is to give up.

As a result, the best way to handle errors is to structure your program to make recovery easier. Recovery is what makes all the difference between “fire-and-forget” and “ignoring-every-error” despite sharing the same optimism.

You can do things that look like ignoring errors, or even letting something else handle it, as long as there’s a plan to recover from them. Even if it’s restarting from scratch, even if it’s waking someone up at night, as long as there’s some plan, then you aren’t ignoring the problem. Assuming the plan works, that is.

You don’t get to ignore errors. They’re inevitably someone’s problem. If someone tells you they can ignore errors, they’re telling you someone else is on-call for their software.

That, or they’re using a message broker.

A message broker, if you’re not sure, is a networked service that offers a variety of queues that other computers on the network can interact with. Usually some clients enqueue messages, and others poll for the next unread message, but they can be used in a variety of other configurations too.

Like with a unix pipe, message brokers are used to get software off the ground. Similarly to using temporary files, the broker allows for different parts of the pipeline to consume and produce inputs at different rates, but don’t easily allow replaying or restarting when errors occur.

Like a unix pipe, message brokers are used in a very optimistic fashion. Firing messages into the queue and moving on to the next task at hand.

Somewhat like a unix pipeline, but with some notable differences. A unix pipeline blocks when full, pausing the producer until the consumer can catch up. A unix pipeline will exit if any of the subcommands exit, and return an error if the last subcommand failed.

A message broker does not block the producer until the consumer can catch up. In theory, this means transient errors or network issues between components don’t bring the entire system down. In practice, the more queues you have in a pipeline, the longer it takes to find out if there’s a problem.

Sometimes that works out. When there’s no growth, brokers act like a buffer between parts of a system, handling variance in load. They work well at slowing bursty clients down, and can provide a central point for auditing or access control.

When there is growth, queues explode regularly until some form of rate limiting appears. When more load arrives, queues are partitioned, and then repartitioned. Scaling a broker inevitably results in moving to something where the queue is bounded, or even ephemeral.

The problem with optimism is that when things do go wrong, not only do you have no idea how to fix it, you don’t even know what went wrong. To some extent, a message broker hides errors—programs can come and go as they please, and there’s no way to tell if the other part is still reading your messages—but it can only hide errors for so long.

In other words, fire-and-regret.

Although an unbounded queue is a tempting abstraction, it rarely lives up to the mythos of freeing you from having to handle errors. Unlike a unix pipeline, a message broker will always fill up your disks before giving up, and changing things to make recovery easy isn’t as straight forward as adding more temporary files.

Brokers can only recover from one error—a temporary network outage—so other mechanisms get brought in to compensate. Timeouts, retries, and sometimes even a second “priority” queue, because head-of-line blocking is genuinely terrible to deal with. Even then, if a worker crashes, messages can still get dropped.

Queues rarely help with recovery. They frequently impede it.

Imagine a build pipeline, or background job service where requests are dumped into some queue with wild abandon. When something breaks, or isn’t running like it is supposed to, you have no idea where to start recovery.

With a background queue, you can’t tell what jobs are currently being run right now. You can’t tell if something’s being retried, or failed, but maybe you’ve got log files you can search through. With logs, you can see what the system was doing a few minutes ago, but you still have no idea what it might be doing right now.

Even if you know the size of a queue, you’ll have to check the dashboard a few minutes later—to see if the line wiggled—before you know for sure if things are probably working. Hopefully.

Making a build pipeline with queues is relatively easy, but building one that the user can cancel, or watch, involves a lot more work. As soon as you want to cancel a task, or inspect a task, you need to keep things somewhere other than a queue.

Knowing what a program is up to means tracking the in-between parts, and even for something as simple as running a background task, it can involve many states—Created, Enqueued, Processing, Complete, Failed, not just Enqueued—and a broker only handles that last part.

Not very well. As soon as one queue feeds into another, an item of work can be in several different queues at once. If an item is missing from the queue, you know it’s either being dropped or processed, if an item is in the queue, you don’t know if it’s being processed, but you do know it will be. A queue doesn’t just hide errors, it hides state too.

Recovery means knowing what state the program was in before things went wrong, and when you fire-and-forget into a queue, you give up on knowing what happens to it. Handling errors, recovering from errors, means building software that can knows what state it is currently operating in. It also means structuring things to make recovery possible.

That, or you give up on on automated recovery of almost any kind. In some ways, I’m not arguing against fire-and-forget, or against optimism—but against optimism that prevents recovery. Not against queues, but how queues inevitably get used.

Unfortunately, recovery is relatively easy to imagine but not necessarily straight forward to implement.

This is why some people opt to use a replicated log, instead of a message broker.

If you’ve never used a replicated log, imagine an append only database table without a primary key, or a text file with backups, and you’re close. Or imagine a message broker, but instead of enqueue and dequeue, you can append to the log or read from the log.

Like a queue, a replicated log can be used in a fire-and-forget fashion with not so great consequences. Just like before, chaos will ensue as concepts like rate-limiting, head-of-line blocking, and the end-to-end-principle are slowly contended with—If you use a replicated log like a queue, it will fail like a queue.

Unlike a queue, a replicated log can aid recovery.

Every consumer sees the same log entries, in the same order, so it’s possible to recover by replaying the log, or by catching up on old entries. In some ways it’s more like using temporary files instead of a pipeline to join things together, and the strategies for recovery overlap with temporary files, too—like partitioning the log so that restarts aren’t as expensive.

Like temporary files, a replicated log can aid in recovery, but only to a certain point. A consumer will see the same messages, in the same order, but if a entry gets dropped before reaching the log, or if entries arrive in the wrong order, some, or potentially all hell can break loose.

You can’t just fire-and-forget into a log, not over a network. Although a replicated log is ordered, it will preserve the ordering it gets, whatever that happens to be.

This isn’t always a problem. Some logs are used to capture analytic data, or fed into aggregators, so the impact of a few missing or out of order entries is relatively low—a few missing entries might as well be called high-volume random sampling and declared a non-issue.

For other logs, missing entries could cause untold misery. Recovering from missing entries might involve rebuilding the entire log from scratch. If you’re using a replicated log for replication, you probably care quite a lot about the order of log entries.

Like before, you can’t ignore errors—you only make things expensive to recover from.

Handling errors like out of order or missing log entries means being able to work out when they have occurred.

This is more difficult than you might imagine.

Take two services, a primary and a secondary, both with databases, and imagine using a replicated log to copy changes from one to another.

It doesn’t seem to difficult at first. Every time the primary service makes a change to the database, it writes to to log. The secondary reads from the log, and updates its database. If the primary service is a single process, it’s pretty easy to ensure that every message is sent in the right order. When there’s more than one writer, things can get rather involved.

Now, you could switch things around—write to the log first, then apply the changes to the database, or use the database’s log directly—and avoid the problem altogether, but these aren’t always an option. Sometimes you’re forced to handle the problem of ordering the entries yourself.

In other words, you’ll need to order the messages before writing them to the log.

You could let something else provide the order, but you’d be mistaken if you think a timestamp would help. Clocks move forwards and backwards and this can cause all sorts of headaches.

One of the most frustrating problems with timestamps is ‘doomstones’: when a service deletes a key but has a wonky clock far out in the future, and issues an event with a similar timestamp. All operations get silently dropped until the deletion event is cleared. The other problem with timestamps is that if you have two entries, one after the other, you can’t tell if there are any entries that came between them.

Things like “Hybrid Logical Clocks”, or even atomic clocks can help to narrow down clock drift, but only so much. You can only narrow down the window of uncertainty, there’s still some clock skew. Again, clocks will go forwards and backwards—timestamps are terrible for ordering things precisely.

In practice you need explicit version numbers, 1,2,3... etc, or a unique identifier for each version of each entry, and a link back to the record being updated, to order messages.

With a version number, messages can be reordered, missing messages can be detected, and both can be recovered from, although managing and assigning those version numbers can be quite difficult in practice. Timestamps are still useful, if only for putting things in a human perspective, but without a version number, it’s impossible to know what precise order things happened in—and that no steps are missing, either.

You don’t get to ignore errors, but sometimes the error handling code isn’t that obvious.

Using version numbers or even timestamps both fall under building a plan for recovery. Building something that can continue to operate in the presence of failure. Unfortunately, building something that works when other parts fail is one of the more challenging parts of software engineering.

It doesn’t help that doing the same thing in the same order is so difficult that people use terms like causality and determinism to make the point sink in.

You don’t get to ignore errors, but no one said it was going to be easy.

Although using things like replicated logs, message brokers, or even using unix pipes can allow you to build prototypes, clear demonstrations of how your software works—they do not free you from the burden of handling errors.

You can’t avoid error handling code, not at scale.

The secret to error handling at scale isn’t giving up, ignoring the problem, or even it trying again—it is structuring a program for recovery, making errors stand out, allowing other parts of the program to make decisions.

Techniques like fail-fast, crash-only-software, process supervision, but also things like clever use of version numbers, and occasionally the odd bit of statelessness or idempotence. What these all have in common is that they’re all methods of recovery.

Recovery is the secret to handling errors. Especially at scale.

Giving up early so other things have a chance, continuing on so other things can catch up, restarting from a clean state to try again, saving progress so that things do not have to be repeated.

That, or put it off for a while. Buy a lot of disks, hire a few SREs, and add another graph to the dashboard.

The problem with scale is that you can’t approach it with optimism. As the system grows, it needs redundancy, or to be able to function in the presence of partial errors or intermittent faults. Humans can only fill in so many gaps.

Staff turnover is the worst form of technical debt.

Writing robust software means building systems that can exist in a state of partial failure (like incomplete output), and writing resilient software means building systems that are always in a state of recovery (like restarting)—neither come from engineering the happy path of your software.

When you ignore errors, you transform them into mysteries to solve. Something or someone else will have to handle them, and then have to recover from them—usually by hand, and almost always at great expense.

The problem with avoiding error handling in code is that you’re only avoiding automating it.

In other words, the trick to scaling in the presence of errors is building software around the notion of recovery. Automated recovery.

That, or burnout. Lots of burnout. You don’t get to ignore errors.

11 notes · View notes

programmingisterrible · 6 years ago

Text

What the hell is REST, Anyway?

Originating in a thesis, REST is an attempt to explain what makes the browser distinct from other networked applications.

You might be able to imagine a few reasons why: there's tabs, there's a back button too, but what makes the browser unique is that a browser can be used to check email, without knowing anything about POP3 or IMAP.

Although every piece of software inevitably grows to check email, the browser is unique in the ability to work with lots of different services without configuration—this is what REST is all about.

HTML only has links and forms, but it's enough to build incredibly complex applications. HTTP only has GET and POST, but that's enough to know when to cache or retry things, HTTP uses URLs, so it's easy to route messages to different places too.

Unlike almost every other networked application, the browser is remarkably interoperable. The thesis was an attempt to explain how that came to be, and called the resulting style REST.

REST is about having a way to describe services (HTML), to identify them (URLs), and to talk to them (HTTP), where you can cache, proxy, or reroute messages, and break up large or long requests into smaller interlinked ones too.

How REST does this isn't exactly clear.

The thesis breaks down the design of the web into a number of constraints—Client-Server, Stateless, Caching, Uniform Interface, Layering, and Code-on-Demand—but it is all too easy to follow them and end up with something that can't be used in a browser.

REST without a browser means little more than "I have no idea what I am doing, but I think it is better than what you are doing.", or worse "We made our API look like a database table, we don't know why". Instead of interoperable tools, we have arguments about PUT or POST, endless debates over how a URL should look, and somehow always end up with a CRUD API and absolutely no browsing.

There are some examples of browsers that don't use HTML, but many of these HTML replacements are for describing collections, and as a result most of the browsers resemble file browsing more than web browsing. It's not to say you need a back and a next button, but it should be possible for one program to work with a variety of services.

For an RPC service you might think about a curl like tool for sending requests to a service:

$ rpctl http://service/ describe MyService methods: ...., my_method $ rpctl http://service/ describe MyService.my_method arguments: name, age $ rpctl http://service/ call MyService.my_method --name="James" --age=31 Result: message: "Hello, James!"

You can also imagine a single command line tool for a databases that might resemble kubectl:

$ dbctl http://service/ list ModelName --where-age=23 $ dbctl http://service/ create ModelName --name=Sam --age=23 $ ...

Now imagine using the same command line tool for both, and using the same command line tool for every service—that's the point of REST. Almost.

$ apictl call MyService:my_method --arg=... $ apictl delete MyModel --where-arg=... $ apictl tail MyContainers:logs --where ... $ apictl help MyService

You could implement a command line tool like this without going through the hassle of reading a thesis. You could download a schema in advance, or load it at runtime, and use it to create requests and parse responses, but REST is quite a bit more than being able to reflect, or describe a service at runtime.

The REST constraints require using a common format for the contents of messages so that the command line tool doesn't need configuring, require sending the messages in a way that allows you to proxy, cache, or reroute them without fully understanding their contents.

REST is also a way to break apart long or large messages up into smaller ones linked together—something far more than just learning what commands can be sent at runtime, but allowing a response to explain how to fetch the next part in sequence.

To demonstrate, take an RPC service with a long running method call:

class MyService(Service): @rpc() def long_running_call(self, args: str) -> bool: id = third_party.start_process(args) while third_party.wait(id): pass return third_party.is_success(id)

When a response is too big, you have to break it down into smaller responses. When a method is slow, you have to break it down into one method to start the process, and another method to check if it's finished.

class MyService(Service): @rpc() def start_long_running_call(self, args: str) -> str: ... @rpc() def wait_for_long_running_call(self, key: str) -> bool: ...

In some frameworks you can use a streaming API instead, but replacing a procedure call with streaming involves adding heartbeat messages, timeouts, and recovery, so many developers opt for polling instead—breaking the single request into two, like the example above.

Both approaches require changing the client and the server code, and if another method needs breaking up you have to change all of the code again. REST offers a different approach.

We return a response that describes how to fetch another request, much like a HTTP redirect. You'd handle them In a client library much like an HTTP client handles redirects does, too.

def long_running_call(self, args: str) -> Result[bool]: key = third_party.start_process(args) return Future("MyService.wait_for_long_running_call", {"key":key}) def wait_for_long_running_call(self, key: str) -> Result[bool]: if not third_party.wait(key): return third_party.is_success(key) else: return Future("MyService.wait_for_long_running_call", {"key":key})

def fetch(request): response = make_api_call(request) while response.kind == 'Future': request = make_next_request(response.method_name, response.args) response = make_api_call(request)

For the more operations minded, imagine I call time.sleep() inside the client, and maybe imagine the Future response has a duration inside. The neat trick is that you can change the amount the client sleeps by changing the value returned by the server.

The real point is that by allowing a response to describe the next request in sequence, we've skipped over the problems of the other two approaches—we only need to implement the code once in the client.

When a different method needs breaking up, you can return a Future and get on with your life. In some ways it's as if you're returning a callback to the client, something the client knows how to run to produce a request. With Future objects, it's more like returning values for a template.

This approach works for breaking up a large response into smaller ones too, like iterating through a long list of results. Pagination often looks something like this in an RPC system:

cursor = rpc.open_cursor() output = [] while cursor: output.append(cursor.values) cursor = rpc.move_cursor(cursor.id)

Or something like this:

start = 0 output = [] while True: out = rpc.get_values(start, batch=30) output.append(out) start += len(out) if len(out) < 30: break

The first pagination example stores state on the server, and gives the client an Id to use in subsequent requests. The second pagination example stores state on the client, and constructs the correct request to make from the state. There's advantages and disadvantages—it's better to store the state on the client (so that the server does less work), but it involves manually threading state and a much harder API to use.

Like before, REST offers a third approach. Instead, the server can return a Cursor response (much like a Future) with a set of values and a request message to send (for the next chunk).

class ValueService(Service): @rpc() def get_values(self): return Cursor("ValueService.get_cursor", {"start":0, "batch":30}, []) @rpc def get_cursor(start, batch): ... return Cursor("ValueService.get_cursor", {"start":start, "batch":batch}, values)

The client can handle a Cursor response, building up a list:

cursor = rpc.get_values() output = [] while cursor: output.append(cursor.values) cursor = cursor.move_next()

It's somewhere between the two earlier examples of pagination—instead of managing the state on the server and sending back an identifier, or managing the state on the client and carefully constructing requests—the state is sent back and forth between them.

As a result, the server can change details between requests! If a Server wants to, it can return a Cursor with a smaller set of values, and the client will just make more requests to get all of them, but without having to track the state of every Cursor open on the service.

This idea of linking messages together isn't just limited to long polling or pagination—if you can describe services at runtime, why can't you return ones with some of the arguments filled in—a Service can contain state to pass into methods, too.

To demonstrate how, and why you might do this, imagine some worker that connects to a service, processes work, and uploads the results. The first attempt at server code might look like this:

class WorkerApi(Service): def register_worker(self, name: str) -> str ... def lock_queue(self, worker_id:str, queue_name: str) -> str: ... def take_from_queue(self, worker_id: str, queue_name, queue_lock: str): ... def upload_result(self, worker_id, queue_name, queue_lock, next, result): ... def unlock_queue(self, worker_id, queue_name, queue_lock): ... def exit_worker(self, worker_id): ...

Unfortunately, the client code looks much nastier:

worker_id = rpc.register_worker(my_name) lock = rpc.lock_queue(worker_id, queue_name) while True: next = rpc.take_from_queue(worker_id, queue_name, lock) if next: result = process(next) rpc.upload_result(worker_id, queue_name, lock, next, result) else: break rpc.unlock_queue(worker_id, queue_name, lock) rpc.exit_worker(worker_id)

Each method requires a handful of parameters, relating to the current session open with the service. They aren't strictly necessary—they do make debugging a system far easier—but problem of having to chain together requests might be a little familiar.

What we'd rather do is use some API where the state between requests is handled for us. The traditional way to achieve this is to build these wrappers by hand, creating special code on the client to assemble the responses.

With REST, we can define a Service that has methods like before, but also contains a little bit of state, and return it from other method calls:

class WorkerApi(Service): def register(self, worker_id): return Lease(worker_id) class Lease(Service): worker_id: str @rpc() def lock_queue(self, name): ... return Queue(self.worker_id, name, lock) @rpc() def expire(self): ... class Queue(Service): name: str lock: str worker_id: str @rpc() def get_task(self): return Task(.., name, lock, worker_id) @rpc() def unlock(self): ... class Task(Service) task_id: str worker_id: str @rpc() def upload(self, out): mark_done(self.task_id, self.actions, out)

Instead of one service, we now have four. Instead of returning identifiers to pass back in, we return a Service with those values filled in for us. As a result, the client code looks a lot nicer—you can even add new parameters in behind the scenes.

lease = rpc.register_worker(my_name) queue = lease.lock_queue(queue_name) while True: next = queue.take_next() if next: next.upload_result(process(next)) else: break queue.unlock() lease.expire()

Although the Future looked like a callback, returning a Service feels like returning an object. This is the power of self description—unlike reflection where you can specify in advance every request that can be made—each response has the opportunity to define a new parameterised request.

It's this navigation through several linked responses that distinguishes a regular command line tool from one that browses—and where REST gets its name: the passing back and forth of requests from server to client is where the 'state-transfer' part of REST comes from, and using a common Result or Cursor object is where the 'representational' comes from.

Although a RESTful system is more than just these combined—along with a reusable browser, you have reusable proxies too.

In the same way that messages describe things to the client, they describe things to any middleware between client and server: using GET, POST, and distinct URLs is what allows caches to work across services, and using a stateless protocol (HTTP) is what allows a proxy or load balancer to work so effortlessly.

The trick with REST is that despite HTTP being stateless, and despite HTTP being simple, you can build complex, stateful services by threading the state invisibly between smaller messages—transferring a representation of state back and forth between client and server.

Although the point of REST is to build a browser, the point is to use self-description and state-transfer to allow heavy amounts of interoperation—not just a reusable client, but reusable proxies, caches, or load balancers.

Going back to the constraints (Client-Server, Stateless, Caching, Uniform Interface, Layering and Code-on-Demand), you might be able to see how they things fit together to achieve these goals.

The first, Client-Server, feels a little obvious, but sets the background. A server waits for requests from a client, and issues responses.

The second, Stateless, is a little more confusing. If a HTTP proxy had to keep track of how requests link together, it would involve a lot more memory and processing. The point of the stateless constraint is that to a proxy, each request stands alone. The point is also that any stateful interactions should be handled by linking messages together.

Caching is the third constraint: labelling if a response can be cached (HTTP uses headers on the response), or if a request can be resent (using GET or POST). The fourth constraint, Uniform Interface, is the most difficult, so we'll cover it last. Layering is the fifth, and it roughly means "you can proxy it".

Code-on-demand is the final, optional, and most overlooked constraint, but it covers the use of Cursors, Futures, or parameterised Services—the idea that despite using a simple means to describe services or responses, the responses can define new requests to send. Code-on-demand takes that further, and imagines passing back code, rather than templates and values to assemble.

With the other constraints handled, it's time for uniform interface. Like stateless, this constraint is more about HTTP than it is about the system atop, and frequently misapplied. This is the reason why people keep making database APIs and calling them RESTful, but the constraint has nothing to do with CRUD.

The constraint is broken down into four ideas, and we'll take them one by one: self-descriptive messages, identification of resources, manipulation of resources through representations, hypermedia as the engine of application state.

Self-Description is at the heart of REST, and this sub-constraint fills in the gaps between the Layering, Caching, and Stateless constraints. Sort-of. It covers using 'GET' and 'POST' to indicate to a proxy how to handle things, and covers how responses indicate if they can be cached, too. It also means using a content-type header.

The next sub-constraint, identification, means using different URLs for different services. In the RPC examples above, it means having a common, standard way to address a service or method, as well as one with parameters.

This ties into the next sub-constraint, which is about using standard representations across services—this doesn't mean using special formats for every API request, but using the same underlying language to describe every response. In other words, the web works because everyone uses HTML.

Uniformity so far isn't too difficult: Use HTTP (self-description), URLs (identification) and HTML (manipulation through representations), but it's the last sub-constraint thats causes most of the headaches. Hypermedia as the engine of application state.

This is a fancy way of talking about how large or long requests can be broken up into interlinked messages, or how a number of smaller requests can be threaded together, passing the state from one to the next. Hypermedia referrs to using Cursor, Future, or Service objects, application state is the details passed around as hidden arguments, and being the 'engine' means using it to tie the whole system together.

Together they form the basis of the Representational State-Transfer Style. More than half of these constraints can be satisfied by just using HTTP, and the other half only really help when you're implementing a browser, but there are still a few more tricks that you can do with REST.

Although a RESTful system doesn't have to offer a database like interface, it can.

Along with Service or Cursor, you could imagine Model or Rows objects to return, but you should expect a little more from a RESTful system than just create, read, update and delete. With REST, you can do things like inlining: along with returning a request to make, a server can embed the result inside. A client can skip the network call and work directly on the inlined response. A server can even make this choice at runtime, opting to embed if the message is small enough.

Finally, with a RESTful system, you should be able to offer things in different encodings, depending on what the client asks for—even HTML. In other words, if your framework can do all of these things for you, offering a web interface isn't too much of a stretch. If you can build a reusable command line tool, generating a web interface isn't too difficult, and at least this time you don't have to implement a browser from scratch.

If you now find yourself understanding REST, I'm sorry. You're now cursed. Like a cross been the greek myths of Cassandra and Prometheus, you will be forced to explain the ideas over and over again to no avail. The terminology has been utterly destroyed to the point it has less meaning than 'Agile'.

Even so, the underlying ideas of interoperability, self-description, and interlinked requests are surprisingly useful—you can break up large or slow responses, you can to browse or even parameterise services, and you can do it in a way that lets you re-use tools across services too.

Ideally someone else will have done it for you, and like with a web browser, you don't really care how RESTful it is, but how useful it is. Your framework should handle almost all of this for you, and you shouldn't have to care about the details.

If anything, REST is about exposing just enough detail—Proxies and load-balancers only care about the URL and GET or POST. The underlying client libraries only have to handle something like HTML, rather than unique and special formats for every service.

REST is fundamentally about letting people use a service without having to know all the details ahead of time, which might be how we got into this mess in the first place.

19 notes · View notes

programmingisterrible · 7 years ago

Text

Repeat yourself, do more than one thing, and rewrite everything

If you ask a programmer for advice—a terrible idea—they might tell you something like the following: Don’t repeat yourself. Programs should do one thing and one thing well. Never rewrite your code from scratch, ever!.

Following “Don’t Repeat Yourself” might lead you to a function with four boolean flags, and a matrix of behaviours to carefully navigate when changing the code. Splitting things up into simple units can lead to awkward composition and struggling to coordinate cross cutting changes. Avoiding rewrites means they’re often left so late that they have no chance of succeeding.

The advice isn’t inherently bad—although there is good intent, following it to the letter can create more problems than it promises to solve.

Sometimes the best way to follow an adage is to do the exact opposite: embrace feature switches and constantly rewrite your code, pull things together to make coordination between them easier to manage, and repeat yourself to avoid implementing everything in one function..

This advice is much harder to follow, unfortunately.

Repeat yourself to find abstractions.

“Don’t Repeat Yourself” is almost a truism—if anything, the point of programming is to avoid work.

No-one enjoys writing boilerplate. The more straightforward it is to write, the duller it is to summon into a text editor. People are already tired of writing eight exact copies of the same code before even having to do so. You don’t need to convince programmers not to repeat themselves, but you do need to teach them how and when to avoid it.

“Don’t Repeat Yourself” often gets interpreted as “Don’t Copy Paste” or to avoid repeating code within the codebase, but the best form of avoiding repetition is in avoiding reimplementing what exists elsewhere—and thankfully most of us already do!

Almost every web application leans heavily on an operating system, a database, and a variety of other lumps of code to get the job done. A modern website reuses millions of lines of code without even trying. Unfortunately, programmers love to avoid repetition, and “Don’t Repeat Yourself” turns into “Always Use an Abstraction”.

By an abstraction, I mean two interlinked things: a idea we can think and reason about, and the way in which we model it inside our programming languages. Abstractions are way of repeating yourself, so that you can change multiple parts of your program in one place. Abstractions allow you to manage cross-cutting changes across your system, or sharing behaviors within it.

The problem with always using an abstraction is that you’re preemptively guessing which parts of the codebase need to change together. “Don’t Repeat Yourself” will lead to a rigid, tightly coupled mess of code. Repeating yourself is the best way to discover which abstractions, if any, you actually need.

As Sandi Metz put it, “duplication is far cheaper than the wrong abstraction”.

You can’t really write a re-usable abstraction up front. Most successful libraries or frameworks are extracted from a larger working system, rather than being created from scratch. If you haven’t built something useful with your library yet, it is unlikely anyone else will. Code reuse isn’t a good excuse to avoid duplicating code, and writing reusable code inside your project is often a form of preemptive optimization.

When it comes to repeating yourself inside your own project, the point isn’t to be able to reuse code, but rather to make coordinated changes. Use abstractions when you’re sure about coupling things together, rather than for opportunistic or accidental code reuse—it’s ok to repeat yourself to find out when.

Repeat yourself, but don’t repeat other people’s hard work. Repeat yourself: duplicate to find the right abstraction first, then deduplicate to implement it.

With “Don’t Repeat Yourself”, some insist that it isn’t about avoiding duplication of code, but about avoiding duplication of functionality or duplication of responsibility. This is more popularly known as the “Single Responsibility Principle”, and it’s just as easily mishandled.

Gather responsibilities to simplify interactions between them

When it comes to breaking a larger service into smaller pieces, one idea is that each piece should only do one thing within the system—do one thing, and do it well—and the hope is that by following this rule, changes and maintenance become easier.

It works out well in the small: reusing variables for different purposes is an ever-present source of bugs. It’s less successful elsewhere: although one class might do two things in a rather nasty way, disentangling it isn’t of much benefit when you end up with two nasty classes with a far more complex mess of wiring between them.

The only real difference between pushing something together and pulling something apart is that some changes become easier to perform than others.

The choice between a monolith and microservices is another example of this—the choice between developing and deploying a single service, or composing things out of smaller, independently developed services.

The big difference between them is that cross-cutting change is easier in one, and local changes are easier in the other. Which one works best for a team often depends more on environmental factors than on the specific changes being made.

Although a monolith can be painful when new features need to be added and microservices can be painful when co-ordination is required, a monolith can run smoothly with feature flags and short lived branches and microservices work well when deployment is easy and heavily automated.

Even a monolith can be decomposed internally into microservices, albeit in a single repository and deployed as a whole. Everything can be broken into smaller parts—the trick is knowing when it’s an advantage to do so.

Modularity is more than reducing things to their smallest parts.

Invoking the ‘single responsibility principle’, programmers have been known to brutally decompose software into a terrifyingly large number of small interlocking pieces—a craft rarely seen outside of obscenely expensive watches, or bash.

The traditional UNIX command line is a showcase of small components that do exactly one function, and it can be a challenge to discover which one you need and in which way to hold it to get the job done. Piping things into awk '{print $2}' is almost a rite of passage.

Another example of the single responsibility principle is git. Although you can use git checkout to do six different things to the repository, they all use similar operations internally. Despite having singular functionality, components can be used in very different ways.

A layer of small components with no shared features creates a need for a layer above where these features overlap, and if absent, the user will create one, with bash aliases, scripts, or even spreadsheets to copy-paste from.

Even adding this layer might not help you: git already has a notion of user-facing and automation-facing commands, and the UI is still a mess. It’s always easier to add a new flag to an existing command than to it is to duplicate it and maintain it in parallel.

Similarly, functions gain boolean flags and classes gain new methods as the needs of the codebase change. In trying to avoid duplication and keep code together, we end up entangling things.

Although components can be created with a single responsibility, over time their responsibilities will change and interact in new and unexpected ways. What a module is currently responsible for within a system does not necessarily correlate to how it will grow.

Modularity is about limiting the options for growth

A given module often gets changed because it is the easiest module to change, rather than the best place for the change to be made. In the end, what defines a module is what pieces of the system it will never responsible for, rather what it is currently responsible for.

When a unit has no rules about what code cannot be included, it will eventually contain larger and larger amounts of the system. This is eternally true of every module named ‘util’, and why almost everything in a Model-View-Controller system ends up in the controller.

In theory, Model-View-Controller is about three interlocking units of code. One for the database, another for the UI, and one for the glue between them. In practice, Model-View-Controller resembles a monolith with two distinct subsystems—one for the database code, another for the UI, both nestled inside the controller.

The purpose of MVC isn’t to just keep all the database code in one place, but also to keep it away from frontend code. The data we have and how we want to view it will change over time independent of the frontend code.

Although code reuse is good and smaller components are good, they should be the result of other desired changes. Both are tradeoffs, introducing coupling through a lack of redundancy, or complexity in how things are composed. Decomposing things into smaller parts or unifying them is neither universally good nor bad for the codebase, and largely depends on what changes come afterwards.

In the same way abstraction isn’t about code reuse, but coupling things for change, modularity isn’t about grouping similar things together by function, but working out how to keep things apart and limiting co-ordination across the codebase.

This means recognizing which bits are slightly more entangled than others, knowing which pieces need to talk to each other, which need to share resources, what shares responsibilities, and most importantly, what external constraints are in place and which way they are moving.

In the end, it’s about optimizing for those changes—and this is rarely achieved by aiming for reusable code, as sometimes handling changes means rewriting everything.

Rewrite Everything

Usually, a rewrite is only a practical option when it’s the only option left. Technical debt, or code the seniors wrote that we can’t be rude about, accrues until all change becomes hazardous. It is only when the system is at breaking point that a rewrite is even considered an option.

Sometimes the reasons can be less dramatic: an API is being switched off, a startup has taken a beautiful journey, or there’s a new fashion in town and orders from the top to chase it. Rewrites can happen to appease a programmer too—rewarding good teamwork with a solo project.

The reason rewrites are so risky in practice is that replacing one working system with another is rarely an overnight change. We rarely understand what the previous system did—many of its properties are accidental in nature. Documentation is scarce, tests are ornamental, and interfaces are organic in nature, stubbornly locking behaviors in place.

If migrating to the replacement depends on switching over everything at once, make sure you’ve booked a holiday during the transition, well in advance.

Successful rewrites plan for migration to and from the old system, plan to ease in the existing load, and plan to handle things being in one or both places at once. Both systems are continuously maintained until one of them can be decommissioned. A slow, careful migration is the only option that reliably works on larger systems.

To succeed, you have to start with the hard problems first—often performance related—but it can involve dealing with the most difficult customer, or biggest customer or user of the system too. Rewrites must be driven by triage, reducing the problem in scope into something that can be effectively improved while being guided by the larger problems at hand.

If a replacement isn’t doing something useful after three months, odds are it will never do anything useful.

The longer it takes to run a replacement system in production, the longer it takes to find bugs. Unfortunately, migrations get pushed back in the name of feature development. A new project has the most room for feature bloat—this is known as the second-system effect.

The second system effect is the name of the canonical doomed rewrite, one where numerous features are planned, not enough are implemented, and what has been written rarely works reliably. It’s a similar to writing a game engine without a game to implement to guide decisions, or a framework without a product inside. The resulting code is an unconstrained mess that is barely fit for its purpose.

The reason we say “Never Rewrite Code” is that we leave rewrites too late, demand too much, and expect them to work immediately. It’s more important to never rewrite in a hurry than to never rewrite at all.

null is true, everything is permitted

The problem with following advice to the letter is that it rarely works in practice. The problem with following it at all costs is that eventually we cannot afford to do so.

It isn’t “Don’t Repeat Yourself”, but “Some redundancy is healthy, some isn’t”, and using abstractions when you’re sure you want to couple things together.

It isn’t “Each thing has a unique component”, or other variants of the single responsibility principle, but “Decoupling parts into smaller pieces is often worth it if the interfaces are simple between them, and try to keep the fast changing and tricky to implement bits away from each other”.

It’s never “Don’t Rewrite!”, but “Don’t abandon what works”. Build a plan for migration, maintain in parallel, then decommission, eventually. In high-growth situations you can probably put off decommissioning, and possibly even migrations.

When you hear a piece of advice, you need to understand the structure and environment in place that made it true, because they can just as often make it false. Things like “Don’t Repeat Yourself” are about making a tradeoff, usually one that’s good in the small or for beginners to copy at first, but hazardous to invoke without question on larger systems.

In a larger system, it’s much harder to understand the consequences of our design choices—in many cases the consequences are only discovered far, far too late in the process and it is only by throwing more engineers into the pit that there is any hope of completion.

In the end, we call our good decisions ‘clean code’ and our bad decisions ‘technical debt’, despite following the same rules and practices to get there.

52 notes · View notes

programmingisterrible · 7 years ago

Text

Write code that's easy to delete, and easy to debug too.

Debuggable code is code that doesn’t outsmart you. Some code is a little to harder to debug than others: code with hidden behaviour, poor error handling, ambiguity, too little or too much structure, or code that’s in the middle of being changed. On a large enough project, you’ll eventually bump into code that you don’t understand.

On an old enough project, you’ll discover code you forgot about writing—and if it wasn’t for the commit logs, you’d swear it was someone else. As a project grows in size it becomes harder to remember what each piece of code does, harder still when the code doesn’t do what it is supposed to. When it comes to changing code you don’t understand, you’re forced to learn about it the hard way: Debugging.

Writing code that’s easy to debug begins with realising you won’t remember anything about the code later.

Rule 0: Good code has obvious faults.

Many used methodology salesmen have argued that the way to write understandable code is to write clean code. The problem is that “clean” is highly contextual in meaning. Clean code can be hardcoded into a system, and sometimes a dirty hack can written in a way that’s easy to turn off. Sometimes the code is clean because the filth has been pushed elsewhere. Good code isn’t necessarily clean code.

Code being clean or dirty is more about how much pride, or embarrassment the developer takes in the code, rather than how easy it has been to maintain or change. Instead of clean, we want boring code where change is obvious— I’ve found it easier to get people to contribute to a code base when the low hanging fruit has been left around for others to collect. The best code might be anything you can look at quickly learn things about it.

Code that doesn’t try to make an ugly problem look good, or a boring problem look interesting.

Code where the faults are obvious and the behaviour is clear, rather than code with no obvious faults and subtle behaviours.

Code that documents where it falls short of perfect, rather than aiming to be perfect.

Code with behaviour so obvious that any developer can imagine countless different ways to go about changing it.

Sometimes, code is just nasty as fuck, and any attempts to clean it up leaves you in a worse state. Writing clean code without understanding the consequences of your actions might as well be a summoning ritual for maintainable code.

It is not to say that clean code is bad, but sometimes the practice of clean coding is more akin to sweeping problems under the rug. Debuggable code isn’t necessarily clean, and code that’s littered with checks or error handling rarely makes for pleasant reading.

Rule 1: The computer is always on fire.

The computer is on fire, and the program crashed the last time it ran.

The first thing a program should do is ensure that it is starting out from a known, good, safe state before trying to get any work done. Sometimes there isn’t a clean copy of the state because the user deleted it, or upgraded their computer. The program crashed the last time it ran and, rather paradoxically, the program is being run for the first time too.

For example, when reading and writing program state to a file, a number of problems can happen:

The file is missing

The file is corrupt

The file is an older version, or a newer one

The last change to the file is unfinished

The filesystem was lying to you

These are not new problems and databases have been dealing with them since the dawn of time (1970-01-01). Using something like SQLite will handle many of these problems for you, but If the program crashed the last time it ran, the code might be run with the wrong data, or in the wrong way too.

With scheduled programs, for example, you can guarantee that the following accidents will occur:

It gets run twice in the same hour because of daylight savings time.

It gets run twice because an operator forgot it had already been run.

It will miss an hour, due to the machine running out of disk, or mysterious cloud networking issues.

It will take longer than an hour to run and may delay subsequent invocations of the program.

It will be run with the wrong time of day

It will inevitably be run close to a boundary, like midnight, end of month, end of year and fail due to arithmetic error.

Writing robust software begins with writing software that assumed it crashed the last time it ran, and crashing whenever it doesn’t know the right thing to do. The best thing about throwing an exception over leaving a comment like “This Shouldn’t Happen”, is that when it inevitably does happen, you get a head-start on debugging your code.

You don’t have to be able to recover from these problems either—it’s enough to let the program give up and not make things any worse. Small checks that raise an exception can save weeks of tracing through logs, and a simple lock file can save hours of restoring from backup.

Code that’s easy to debug is code that checks to see if things are correct before doing what was asked of it, code that makes it easy to go back to a known good state and trying again, and code that has layers of defence to force errors to surface as early as possible.

Rule 2: Your program is at war with itself.

Google’s biggest DoS attacks come from ourselves—because we have really big systems—although every now and then someone will show up and try to give us a run for our money, but really we’re more capable of hammering ourselves into the ground than anybody else is.

This is true for all systems.

Astrid Atkinson, Engineering for the Long Game

The software always crashed the last time it ran, and now it is always out of cpu, out of memory, and out of disk too. All of the workers are hammering an empty queue, everyone is retrying a failed request that’s long expired, and all of the servers have paused for garbage collection at the same time. Not only is the system broken, it is constantly trying to break itself.

Even checking if the system is actually running can be quite difficult.

It can be quite easy to implement something that checks if the server is running, but not if it is handling requests. Unless you check the uptime, it is possible that the program is crashing in-between every check. Health checks can trigger bugs too: I have managed to write health checks that crashed the system it was meant to protect. On two separate occasions, three months apart.

In software, writing code to handle errors will inevitably lead to discovering more errors to handle, many of them caused by the error handling itself. Similarly, performance optimisations can often be the cause of bottlenecks in the system—Making an app that’s pleasant to use in one tab can make an app that’s painful to use when you have twenty copies of it running.

Another example is where a worker in a pipeline is running too fast, and exhausting the available memory before the next part has a chance to catch up. If you’d rather a car metaphor: traffic jams. Speeding up is what creates them, and can be seen in the way the congestion moves back through the traffic. Optimisations can create systems that fail under high or heavy load, often in mysterious ways.

In other words: the faster you make it, the harder it will be pushed, and if you don’t allow your system to push back even a little, don’t be surprised if it snaps.

Back-pressure is one form of feedback within a system, and a program that is easy to debug is one where the user is involved in the feedback loop, having insight into all behaviours of a system, the accidental, the intentional, the desired, and the unwanted too. Debuggable code is easy to inspect, where you can watch and understand the changes happening within.

Rule 3: What you don’t disambiguate now, you debug later.

In other words: it should not be hard to look at the variables in your program and work out what is happening. Give or take some terrifying linear algebra subroutines, you should strive to represent your program’s state as obviously as possible. This means things like not changing your mind about what a variable does halfway through a program, if there is one obvious cardinal sin it is using a single variable for two different purposes.

It also means carefully avoiding the semi-predicate problem, never using a single value (count) to represent a pair of values (boolean, count). Avoiding things like returning a positive number for a result, and returning -1 when nothing matches. The reason is that it’s easy to end up in the situation where you want something like "0, but true" (and notably, Perl 5 has this exact feature), or you create code that’s hard to compose with other parts of your system (-1 might be a valid input for the next part of the program, rather than an error).

Along with using a single variable for two purposes, it can be just as bad to use a pair of variables for a single purpose—especially if they are booleans. I don’t mean keeping a pair of numbers to store a range is bad, but using a number of booleans to indicate what state your program is in is often a state machine in disguise.

When state doesn’t flow from top to bottom, give or take the occasional loop, it’s best to give the state a variable of it’s own and clean the logic up. If you have a set of booleans inside an object, replace it with a variable called state and use an enum (or a string if it’s persisted somewhere). The if statements end up looking like if state == name and stop looking like if bad_name && !alternate_option.

Even when you do make the state machine explicit, you can still mess up: sometimes code has two state machines hidden inside. I had great difficulty writing an HTTP proxy until I had made each state machine explicit, tracing connection state and parsing state separately. When you merge two state machines into one, it can be hard to add new states, or know exactly what state something is meant to be in.

This is far more about creating things you won’t have to debug, than making things easy to debug. By working out the list of valid states, it’s far easier to reject the invalid ones outright, rather than accidentally letting one or two through.

Rule 4: Accidental Behaviour is Expected Behaviour.

When you’re less than clear about what a data structure does, users fill in the gaps—any behaviour of your code, intended or accidental, will eventually be relied upon somewhere else. Many mainstream programming languages had hash tables you could iterate through, which sort-of preserved insertion order, most of the time.

Some languages chose to make the hash table behave as many users expected them to, iterating through the keys in the order they were added, but others chose to make the hash table return keys in a different order, each time it was iterated through. In the latter case, some users then complained that the behaviour wasn’t random enough.

Tragically, any source of randomness in your program will eventually be used for statistical simulation purposes, or worse, cryptography, and any source of ordering will be used for sorting instead.

In a database, some identifiers carry a little bit more information than others. When creating a table, a developer can choose between different types of primary key. The correct answer is a UUID, or something that’s indistinguishable from a UUID. The problem with the other choices is that they can expose ordering information as well as identity, i.e. not just if a == b but if a <= b, and by other choices mean auto-incrementing keys.

With an auto-incrementing key, the database assigns a number to each row in the table, adding 1 when a new row is inserted. This creates an ambiguity of sorts: people do not know which part of the data is canonical. In other words: Do you sort by key, or by timestamp? Like with the hash-tables before, people will decide the right answer for themselves. The other problem is that users can easily guess the other keys records nearby, too.

Ultimately any attempt to be smarter than a UUID will backfire: we already tried with postcodes, telephone numbers, and IP Addresses, and we failed miserably each time. UUIDs might not make your code more debuggable, but less accidental behaviour tends to mean less accidents.

Ordering is not the only piece of information people will extract from a key: If you create database keys that are constructed from the other fields, then people will throw away the data and reconstruct it from the key instead. Now you have two problems: when a program’s state is kept in more than one place, it is all too easy for the copies to start disagreeing with each other. It’s even harder to keep them in sync if you aren’t sure which one you need to change, or which one you have changed.

Whatever you permit your users to do, they’ll implement. Writing debuggable code is thinking ahead about the ways in which it can be misused, and how other people might interact with it in general.

Rule 5: Debugging is social, before it is technical.

When a software project is split over multiple components and systems, it can be considerably harder to find bugs. Once you understand how the problem occurs, you might have to co-ordinate changes across several parts in order to fix the behaviour. Fixing bugs in a larger project is less about finding the bugs, and more about convincing the other people that they’re real, or even that a fix is possible.

Bugs stick around in software because no-one is entirely sure who is responsible for things. In other words, it’s harder to debug code when nothing is written down, everything must be asked in Slack, and nothing gets answered until the one person who knows logs-on.

Planning, tools, process, and documentation are the ways we can fix this.

Planning is how we can remove the stress of being on call, structures in place to manage incidents. Plans are how we keep customers informed, switch out people when they’ve been on call too long, and how we track the problems and introduce changes to reduce future risk. Tools are the way in which we deskill work and make it accessible to others. Process is the way in which can we remove control from the individual and give it to the team.

The people will change, the interactions too, but the processes and tools will be carried on as the team mutates over time. It isn’t so much valuing one more than the other but building one to support changes in the other.Process can also be used to remove control from the team too, so it isn’t always good or bad, but there is always some process at work, even when it isn’t written down, and the act of documenting it is the first step to letting other people change it.

Documentation means more than text files: documentation is how you handover responsibilities, how you bring new people up to speed, and how you communicate what’s changed to the people impacted by those changes. Writing documentation requires more empathy than writing code, and more skill too: there aren’t easy compiler flags or type checkers, and it’s easy to write a lot of words without documenting anything.

Without documentation, how can you expect people to make informed decisions, or even consent to the consequences of using the software? Without documentation, tools, or processes you cannot share the burden of maintenance, or even replace the people currently lumbered with the task.

Making things easy to debug applies just as much to the processes around code as the code itself, making it clear whose toes you will have to stand on to fix the code.

Code that’s easy to debug is easy to explain.

A common occurrence when debugging is realising the problem when explaining it to someone else. The other person doesn’t even have to exist but you do have to force yourself to start from scratch, explain the situation, the problem, the steps to reproduce it, and often that framing is enough to give us insight into the answer.

If only. Sometimes when we ask for help, we don’t ask for the right help, and I’m as guilty of this as anyone—it’s such a common affliction that it has a name: “The X-Y Problem”: How do I get the last three letters of a filename? Oh? No, I meant the file extension.

We talk about problems in terms of the solutions we understand, and we talk about the solutions in terms of the consequences we’re aware of. Debugging is learning the hard way about unexpected consequences, and alternative solutions, and involves one of the hardest things a programer can ever do: admit that they got something wrong.

It wasn’t a compiler bug, after all.

242 notes · View notes

programmingisterrible · 8 years ago

Link

Think of a team you work with closely. How strongly do you agree with these five statements?

If I take a chance and screw up, it will be held against me.

Our team has a strong sense of culture that can be hard for new people to join.

My team is slow to offer help to people who are struggling.

Using my unique skills and talents comes second to the objectives of the team.

It’s uncomfortable to have open, honest conversations about our team’s sensitive issues.

Teams that score high on questions like these can be deemed to be “unsafe.”

21 notes · View notes

programmingisterrible · 8 years ago

Text

How do you cut a monolith in half?

It depends.

The problem with distributed systems, is that no matter what the question is, the answer is inevitably ‘It Depends’.

When you cut a larger service apart, where you cut depends on latency, resources, and access to state, but it also depends on error handling, availably and recovery processes. It depends, but you probably don’t want to depend on a message broker.

Using a message broker to distribute work is like a cross between a load balancer with a database, with the disadvantages of both and the advantages of neither.

Message brokers, or persistent queues accessed by publish-subscribe, are a popular way to pull components apart over a network. They’re popular because they often have a low setup cost, and provide easy service discovery, but they can come at a high operational cost, depending where you put them in your systems.

In practice, a message broker is a service that transforms network errors and machine failures into filled disks. Then you add more disks. The advantage of publish-subscribe is that it isolates components from each other, but the problem is usually gluing them together.

For short-lived tasks, you want a load balancer

For short-lived tasks, publish-subscribe is a convenient way to build a system quickly, but you inevitably end up implementing a new protocol atop. You have publish-subscribe, but you really want request-response. If you want something computed, you’ll probably want to know the result.

Starting with publish-subscribe makes work assignment easy: jobs get added to the queue, workers take turns to remove them. Unfortunately, it makes finding out what happened quite hard, and you’ll need to add another queue to send a result back.

Once you can handle success, it is time to handle the errors. The first step is often adding code to retry the request a few times. After you DDoS your system, you put a call to sleep(). After you slowly DDoS your system, each retry waits twice as long as the previous.

(Aside: Accidental synchronisation is still a problem, as waiting to retry doesn’t prevent a lot of things happening at once.)

As workers fail to keep up, clients give up and retry work, but the earlier request is still waiting to be processed. The solution is to move some of the queue back to clients, asking them to hold onto work until work has been accepted: back-pressure, or acknowledgements.

Although the components interact via publish-subscribe, we’ve created a request-response protocol atop. Now the message broker is really only doing two useful things: service discovery, and load balancing. It is also doing two not-so-useful thing: enqueuing requests, and persisting them.

For short-lived tasks, the persistence is unnecessary: the client sticks around for as long as the work needs to be done, and handles recovery. The queuing isn’t that necessary either.

Queues inevitably run in two states: full, or empty. If your queue is running full, you haven’t pushed enough work to the edges, and if it is running empty, it’s working as a slow load balancer.

A mostly empty queue is still first-come-first-served, serving as point of contention for requests. A broker often does nothing but wait for workers to poll for new messages. If your queue is meant to run empty, why wait to forward on a request.

(Aside: Something like random load balancing will work, but join-idle-queue is well worth your time investigating)

For distributing short-lived tasks, you can use a message broker, but you’ll be building a load balancer, along with an ad-hoc RPC system, with extra latency.

For long-lived tasks, you’ll need a database

A load balancer with service discovery won’t help you with long running tasks, or work that outlives the client, or manage throughput. You’ll want persistence, but not in your message broker. For long-lived tasks, you’ll want a database instead.

Although the persistence and queueing were obstacles for short-lived tasks, the disadvantages are less obvious for long-lived tasks, but similar things can go wrong.

If you care about the result of a task, you’ll want to store that it is needed somewhere other than in the persistent queue. If the task is run but fails midway, something will have to take responsibility for it, and the broker will have forgotten. This is why you want a database.

Duplicates in a queue often cause more headaches, as long-lived tasks have more opportunities to overlap. Although we’re using the broker to distribute work, we’re also using it implicitly as a mutex. To stop work from overlapping, you implement a lock atop. After it breaks a couple of times, you replace it with leases, adding timeouts.

(Note: This is not why you want a database, using transactions for long running tasks is suffering. Long running processes are best modelled as state machines.)

When the database becomes the primary source of truth, you can handle a broker going offline, or a broker losing the contents of a queue, by backfilling from the database. As a result, you don’t need to directly enqueue work with the broker, but mark it as required in the database, and wait for something else to handle it.

Assuming that something else isn’t a human who has been paged.

A message pump can scan the database periodically and send work requests to the broker. Enqueuing work in batches can be an effective way of making an expensive database call survivable. The pump responsible for enqueuing the work can also track if it has completed, and so handle recovery or retries too.

Backlog is still a problem, so you’ll want to use back-pressure to keep the queue fairly empty, and only fill from the database when needed. Although a broker can handle temporary overload, back-pressure should mean it never has to.

At this point the message broker is really providing two things: service discovery, and work assignment, but really you need a scheduler. A scheduler is what scans a database, works out which jobs need to run, and often where to run them too. A scheduler is what takes responsibility for handling errors.

(Aside: Writing a scheduler is hard. It is much easier to have 1000 while loops waiting for the right time, than one while loop waiting for which of the 1000 is first. A scheduler can track when it last ran something, but the work can’t rely on that being the last time it ran. Idempotency isn’t just your friend, it is your saviour.)

You can use a message broker for long-lived tasks, but you’ll be building a lock manager, a database, and a scheduler, along with yet another home-brew request-response system.

Publish-Subscribe is about isolating components

The problem with running tasks with publish-subscribe is that you really want request-response. The problem with using queues to assign work is that you don’t want to wait for a worker to ask.

The problem with relying on a persistent queue for recovery, is that recovery must get handled elsewhere, and the problem with brokers is nothing else makes service discovery so trivial.

Message brokers can be misused, but it isn’t to say they have no use. Brokers work well when you need to cross system boundaries.

Although you want to keep queues empty between components, it is convenient to have a buffer at the edges of your system, to hide some failures from external clients. When you handle external faults at the edges, you free the insides from handling them. The inside of your system can focus on handling internal problems, of which there are many.

A broker can be used to buffer work at the edges, but it can also be used as an optimisation, to kick off work a little earlier than planned. A broker can pass on a notification that data has been changed, and the system can fetch data through another API.

(Aside: If you use a broker to speed up a process, the system will grow to rely on it for performance. People use caches to speed up database calls, but there are many systems that simply do not work fast enough until the cache is warmed up, filled with data. Although you are not relying on the message broker for reliability, relying on it for performance is just as treacherous.)

Sometimes you want a load balancer, sometimes you’ll need a database, but sometimes a message broker will be a good fit.

Although persistence can’t handle many errors, it is convenient if you need to restart with new code or settings, without data loss. Sometimes the error handling offered is just right.

Although a persistent queue offers some protection against failure, it can’t take responsibility for when things go wrong halfway through a task. To be able to recover from failure you have to stop hiding it, you must add acknowledgements, back-pressure, error handling, to get back to a working system.

A persistent message queue is not bad in itself, but relying on it for recovery, and by extension, correct behaviour, is fraught with peril.

Systems grow by pushing responsibilities to the edges

Performance isn’t easy either. You don’t want queues, or persistence in the central or underlying layers of your system. You want them at the edges.

It’s slow is the hardest problem to debug, and often the reason is that something is stuck in a queue. For long and short-lived tasks, we used back-pressure to keep the queue empty, to reduce latency.

When you have several queues between you and the worker, it becomes even more important to keep the queue out of the centre of the network. We’ve spent decades on tcp congestion control to avoid it.

If you’re curious, the history of tcp congestion makes for interesting reading. Although the ends of a tcp connection were responsible for failure and retries, the routers were responsible for congestion: drop things when there is too much.

The problem is that it worked until the network was saturated, and similar to backlog in queues, when it broke, errors cascaded. The solution was similar: back-pressure. Similar to sleeping twice as long on errors, tcp sends half as many packets, before gradually increasing the amount as things improve.

Back-pressure is about pushing work to the edges, letting the ends of the conversation find stability, rather than trying to optimise all of the links in-between in isolation. Congestion control is about using back-pressure to keep the queues in-between as empty as possible, to keep latency down, and to increase throughput by avoiding the need to drop packets.

Pushing work to the edges is how your system scales. We have spent a lot of time and a considerable amount of money on IP-Multicast, but nothing has been as effective as BitTorrent. Instead of relying on smart routers to work out how to broadcast, we rely on smart clients to talk to each other.

Pushing recovery to the outer layers is how your system handles failure. In the earlier examples, we needed to get the client, or the scheduler to handle the lifecycle of a task, as it outlived the time on the queue.

Error recovery in the lower layers of a system is an optimisation, and you can’t push work to the centre of a network and scale. This is the end-to-end principle, and it is one of the most important ideas in system design.

The end-to-end principle is why you can restart your home router, when it crashes, without it having to replay all of the websites you wanted to visit before letting you ask for a new page. The browser (and your computer) is responsible for recovery, not the computers in between.

This isn’t a new idea, and Erlang/OTP owes a lot to it. OTP organises a running program into a supervision tree. Each process will often have one process above it, restarting it on failure, and above that, another supervisor to do the same.

(Aside: Pipelines aren’t incompatible with process supervision, one way is for each part to spawn the program that reads its output. A failure down the chain can propagate back up to be handled correctly.)

Although each program will handle some errors, the top levels of the supervision tree handle larger faults with restarts. Similarly, it’s nice if your webpage can recover from a fault, but inevitably someone will have to hit refresh.

The end-to-end principle is realising that no matter how many exceptions you handle deep down inside your program, some will leak out, and something at the outer layer has to take responsibility.

Although sometimes taking responsibility is writing things to an audit log, and message brokers are pretty good at that.

Aside: But what about replicated logs?

“How do I subscribe to the topic on the message broker?”

“It’s not a message broker, it’s a replicated log”

“Ok, How do I subscribe to the replicated log”

From ‘I believe I did, Bob’, jrecursive

Although a replicated log is often confused with a message broker, they aren’t immune from handling failure. Although it’s good the components are isolated from each other, they still have to be integrated into the system at large. Both offer a one way stream for sharing, both offer publish-subscribe like interfaces, but the intent is wildly different.

A replicated log is often about auditing, or recovery: having a central point of truth for decisions. Sometimes a replicated log is about building a pipeline with fan-in (aggregating data), or fan-out (broadcasting data), but always building a system where data flows in one direction.

The easiest way to see the difference between a replicated log and a message broker is to ask an engineer to draw a diagram of how the pieces connect.

If the diagram looks like a one-way system, it’s a replicated log. If almost every component talks to it, it’s a message broker. If you can draw a flow-chart, it’s a replicated log. If you take all the arrows away and you’re left with a venn diagram of ‘things that talk to each other’, it’s a message broker.

Be warned: A distributed system is something you can draw on a whiteboard pretty quickly, but it’ll take hours to explain how all the pieces interact.

You cut a monolith with a protocol

How you cut a monolith is often more about how you are cutting up responsibility within a team, than cutting it into components. It really does depend, and often more on the social aspects than the technical ones, but you are still responsible for the protocol you create.

Distributed systems are messy because of how the pieces interact over time, rather than which pieces are interacting. The complexity of a distributed system does not come from having hundreds of machines, but hundreds of ways for them to interact. A protocol must take into account performance, safety, stability, availability, and most importantly, error handling.

When we talk about distributed systems, we are talking about power structures: how resources are allocated, how work is divided, how control is shared, or how order is kept across systems ostensibly built out of well meaning but faulty components.

A protocol is the rules and expectations of participants in a system, and how they are beholden to each other. A protocol defines who takes responsibility for failure.

The problem with message brokers, and queues, is that no-one does.

Using a message broker is not the end of the world, nor a sign of poor engineering. Using a message broker is a tradeoff. Use them freely knowing they work well on the edges of your system as buffers. Use them wisely knowing that the buck has to stop somewhere else. Use them cheekily to get something working.

I say don’t rely on a message broker, but I can’t point to easy off-the-shelf answers. HTTP and DNS are remarkable protocols, but I still have no good answers for service discovery.

Lots of software regularly gets pushed into service way outside of its designed capabilities, and brokers are no exception. Although the bad habits around brokers and the relative ease of getting a prototype up and running lead to nasty effects at scale, you don’t need to build everything at once.

The complexity of a system lies in its protocol not its topology, and a protocol is what you create when you cut your monolith into pieces. If modularity is about building software, protocol is about how we break it apart.

The main task of the engineering analyst is not merely to obtain “solutions” but is rather to understand the dynamic behaviour of the system in such a way that the secrets of the mechanism are revealed, and that if it is built it will have no surprises left for [them]. Other than exhaustive physical experimentations, this is the only sound basis for engineering design, and disregard of this cardinal principle has not infrequently lead to disaster.

From “Analysis of Nonlinear Control Systems” by Dustan Graham and Duane McRuer, p 436

Protocol is the reason why ‘it depends’, and the reason why you shouldn’t depend on a message broker: You can use a message broker to glue systems together, but never use one to cut systems apart.

35 notes · View notes

programmingisterrible · 9 years ago

Video

youtube

I like this talk a lot: what modularity is, what we use it for, how modularity happens in systems, and how we can use modularity to manage change.

10 notes · View notes

programmingisterrible · 9 years ago

Text

RIP, Mathie.

Last night I found out i'd lost a friend, and if you'll be patient with my words, I'd like to reflect a little.

Mathie was one of the older, weirder, geeks I met. I'd escaped my home town in the edge of nowhere, and it was my first time having a peer group of adults.

He'd helped everywhere. With the student run shell server, with the local IRC server everyone collected on, a known and friendly face on the circuit

Mathie was one of the many people behind Scottish Ruby Conference, responsible for bringing a lot of interesting people into Edinburgh, and into my life.

Why wasn't this talk given at Scottish Ruby Conference

I fucked up

In front of everyone assembled at the fringe track, a collection of talks that didn't quite make it, mathie answered honestly. It's kinda how i'll remember him: a bit of a fuckup.

A fuckup who, changed my life for the better.

Thanks mathie, I hope to pass on some of your kindness.

RIP, You fucking idiot.

16 notes · View notes

programmingisterrible · 9 years ago

Text

PapersWeLove London: End-to-End Arguments In System Design

This week I gave a short talk on a paper I love: End-to-End Arguments in System Design

The talk was recorded and uploaded (but not captioned), and you can watch it here: https://skillsmatter.com/skillscasts/8200-end-to-end-arguments-in-system-design-by-saltzer-reed-and-clark

6 notes · View notes

programmingisterrible · 9 years ago

Text

A million things to do with a computer!

I gave a talk at !!con last weekend, about my favourite programming language scratch:

Back in 1971, Cynthia Solomon and Seymour Papert wrote “Twenty things to do with a computer”, about their experiences of teaching children to use Logo and their ideas for the future.

They were wrong: There’s a lot more than twenty. Logo’s successor, Scratch, has over thirteen million things that children and adults alike have built. Scratch is radically approachable in a way that puts every other language to shame.

This talk is about the history, present, and future of Scratch: why Scratch is about ‘coding to learn’, and not about ‘learning to code’.

I had a incredible time at !!con. The live captioning was fantastic (and they're crowdfunding a game to teach steno too).

The livestreams are up (but no captions), and my talk is 3h29m32s in on day 2.

19 notes · View notes

programmingisterrible · 9 years ago

Text

Addendum: Write code that is easy to delete, not easy to extend.

I found two translations by accident. I can’t tell if they are perfect translations but I am thankful nonetheless.

Russian

Chinese

(Many people mentioned The Wrong Abstraction, and it is worth mentioning here too.)

12 notes · View notes

programmingisterrible · 9 years ago

Text

Write code that is easy to delete, not easy to extend.

“Every line of code is written without reason, maintained out of weakness, and deleted by chance” Jean-Paul Sartre’s Programming in ANSI C.

Every line of code written comes at a price: maintenance. To avoid paying for a lot of code, we build reusable software. The problem with code re-use is that it gets in the way of changing your mind later on.

The more consumers of an API you have, the more code you must rewrite to introduce changes. Similarly, the more you rely on an third-party api, the more you suffer when it changes. Managing how the code fits together, or which parts depend on others, is a significant problem in large scale systems, and it gets harder as your project grows older.

My point today is that, if we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent” EWD 1036

If we see ‘lines of code’ as ‘lines spent’, then when we delete lines of code, we are lowering the cost of maintenance. Instead of building re-usable software, we should try to build disposable software.

I don’t need to tell you that deleting code is more fun than writing it.

To write code that’s easy to delete: repeat yourself to avoid creating dependencies, but don’t repeat yourself to manage them. Layer your code too: build simple-to-use APIs out of simpler-to-implement but clumsy-to-use parts. Split your code: isolate the hard-to-write and the likely-to-change parts from the rest of the code, and each other. Don’t hard code every choice, and maybe allow changing a few at runtime. Don’t try to do all of these things at the same time, and maybe don’t write so much code in the first place.

Step 0: Don’t write code

The number of lines of code doesn’t tell us much on its own, but the magnitude does 50, 500 5,000, 10,000, 25,000, etc. A million line monolith is going to be more annoying than a ten thousand line one and significantly more time, money, and effort to replace.

Although the more code you have the harder it is to get rid of, saving one line of code saves absolutely nothing on its own.

Even so, the easiest code to delete is the code you avoided writing in the first place.

Step 1: Copy-paste code

Building reusable code is something that’s easier to do in hindsight with a couple of examples of use in the code base, than foresight of ones you might want later. On the plus side, you’re probably re-using a lot of code already by just using the file-system, why worry that much? A little redundancy is healthy.

It’s good to copy-paste code a couple of times, rather than making a library function, just to get a handle on how it will be used. Once you make something a shared API, you make it harder to change.

The code that calls your function will rely on both the intentional and the unintentional behaviours of the implementation behind it. The programmers using your function will not rely on what you document, but what they observe.

It’s simpler to delete the code inside a function than it is to delete a function.

Step 2: Don’t copy paste code

When you’ve copy and pasted something enough times, maybe it’s time to pull it up to a function. This is the “save me from my standard library” stuff: the “open a config file and give me a hash table”, “delete this directory”. This includes functions without any state, or functions with a little bit of global knowledge like environment variables. The stuff that ends up in a file called “util”.

Aside: Make a util directory and keep different utilities in different files. A single util file will always grow until it is too big and yet too hard to split apart. Using a single util file is unhygienic.

The less specific the code is to your application or project, the easier they are to re-use and the less likely to change or be deleted. Library code like logging, or third party APIs, file handles, or processes. Other good examples of code you’re not going to delete are lists, hash tables, and other collections. Not because they often have very simple interfaces, but because they don’t grow in scope over time.

Instead of making code easy-to-delete, we are trying to keep the hard-to-delete parts as far away as possible from the easy-to-delete parts.

Step 3: Write more boilerplate

Despite writing libraries to avoid copy pasting, we often end up writing a lot more code through copy paste to use them, but we give it a different name: boilerplate. Boiler plate is a lot like copy-pasting, but you change some of the code in a different place each time, rather than the same bit over and over.

Like with copy paste, we are duplicating parts of code to avoid introducing dependencies, gain flexibility, and pay for it in verbosity.

Libraries that require boilerplate are often stuff like network protocols, wire formats, or parsing kits, stuff where it’s hard to interweave policy (what a program should do), and protocol (what a program can do) together without limiting the options. This code is hard to delete: it’s often a requirement for talking to another computer or handling different files, and the last thing we want to do is litter it with business logic.

This is not an exercise in code reuse: we’re trying keep the parts that change frequently, away from the parts that are relatively static. Minimising the dependencies or responsibilities of library code, even if we have to write boilerplate to use it.

You are writing more lines of code, but you are writing those lines of code in the easy-to-delete parts.

Step 4: Don’t write boilerplate

Boilerplate works best when libraries are expected to cater to all tastes, but sometimes there is just too much duplication. It’s time to wrap your flexible library with one that has opinions on policy, workflow, and state. Building simple-to-use APIs is about turning your boilerplate into a library.

This isn’t as uncommon as you might think: One of the most popular and beloved python http clients, requests, is a successful example of providing a simpler interface, powered by a more verbose-to-use library urllib3 underneath. requests caters to common workflows when using http, and hides many practical details from the user. Meanwhile, urllib3 does the pipelining, connection management, and does not hide anything from the user.

It is not so much that we are hiding detail when we wrap one library in another, but we are separating concerns: requests is about popular http adventures, urllib3 is about giving you the tools to choose your own adventure.

I’m not advocating you go out and create a /protocol/ and a /policy/ directory, but you do want to try and keep your util directory free of business logic, and build simpler-to-use libraries on top of simpler-to-implement ones. You don’t have to finish writing one library to start writing another atop.

It’s often good to wrap third party libraries too, even if they aren’t protocol-esque. You can build a library that suits your code, rather than lock in your choice across the project. Building a pleasant to use API and building an extensible API are often at odds with each other.

This split of concerns allows us to make some users happy without making things impossible for other users. Layering is easiest when you start with a good API, but writing a good API on top of a bad one is unpleasantly hard. Good APIs are designed with empathy for the programmers who will use it, and layering is realising we can’t please everyone at once.

Layering is less about writing code we can delete later, but making the hard to delete code pleasant to use (without contaminating it with business logic).

Step 5: Write a big lump of code

You’ve copy-pasted, you’ve refactored, you’ve layered, you’ve composed, but the code still has to do something at the end of the day. Sometimes it’s best just to give up and write a substantial amount of trashy code to hold the rest together.

Business logic is code characterised by a never ending series of edge cases and quick and dirty hacks. This is fine. I am ok with this. Other styles like ‘game code’, or ‘founder code’ are the same thing: cutting corners to save a considerable amount of time.

The reason? Sometimes it’s easier to delete one big mistake than try to delete 18 smaller interleaved mistakes. A lot of programming is exploratory, and it’s quicker to get it wrong a few times and iterate than think to get it right first time.

This is especially true of more fun or creative endeavours. If you’re writing your first game: don’t write an engine. Similarly, don’t write a web framework before writing an application. Go and write a mess the first time. Unless you’re psychic you won’t know how to split it up.

Monorepos are a similar tradeoff: You won’t know how to split up your code in advance, and frankly one large mistake is easier to deploy than 20 tightly coupled ones.

When you know what code is going to be abandoned soon, deleted, or easily replaced, you can cut a lot more corners. Especially if you make one-off client sites, event web pages. Anything where you have a template and stamp out copies, or where you fill in the gaps left by a framework.

I’m not suggesting you write the same ball of mud ten times over, perfecting your mistakes. To quote Perlis: “Everything should be built top-down, except the first time”. You should be trying to make new mistakes each time, take new risks, and slowly build up through iteration.

Becoming a professional software developer is accumulating a back-catalogue of regrets and mistakes. You learn nothing from success. It is not that you know what good code looks like, but the scars of bad code are fresh in your mind.

Projects either fail or become legacy code eventually anyway. Failure happens more than success. It’s quicker to write ten big balls of mud and see where it gets you than try to polish a single turd.

It’s easier to delete all of the code than to delete it piecewise.

Step 6: Break your code into pieces

Big balls of mud are the easiest to build but the most expensive to maintain. What feels like a simple change ends up touching almost every part of the code base in an ad-hoc fashion. What was easy to delete as a whole is now impossible to delete piecewise.

In the same we have layered our code to separate responsibilities, from platform specific to domain specific, we need to find a means to tease apart the logic atop.

[Start] with a list of difficult design decisions or design decisions which are likely to change. Each module is then designed to hide such a decision from the others. D. Parnas

Instead of breaking code into parts with common functionality, we break code apart by what it does not share with the rest. We isolate the most frustrating parts to write, maintain, or delete away from each other.

We are not building modules around being able to re-use them, but being able to change them.

Unfortunately, some problems are more intertwined and hard to separate than others. Although the single responsibility principle suggests that ‘each module should only handle one hard problem’, it is more important that ‘each hard problem is only handled by one module’

When a module does two things, it is usually because changing one part requires changing the other. It is often easier to have one awful component with a simple interface, than two components requiring a careful co-ordination between them.

I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description [”loose coupling”], and perhaps I could never succeed in intelligibly doing so. But I know it when I see it, and the code base involved in this case is not that. SCOTUS Justice Stewart

A system where you can delete parts without rewriting others is often called loosely coupled, but it’s a lot easier to explain what one looks like rather than how to build it in the first place.

Even hardcoding a variable once can be loose coupling, or using a command line flag over a variable. Loose coupling is about being able to change your mind without changing too much code.

For example, Microsoft Windows has internal and external APIs for this very purpose. The external APIs are tied to the lifecycle of desktop programs, and the internal API is tied to the underlying kernel. Hiding these APIs away gives Microsoft flexibility without breaking too much software in the process.

HTTP has examples of loose coupling too: Putting a cache in front of your HTTP server. Moving your images to a CDN and just changing the links to them. Neither breaks the browser.

HTTP’s error codes are another example of loose coupling: common problems across web servers have unique codes. When you get a 400 error, doing it again will get the same result. A 500 may change. As a result, HTTP clients can handle many errors on the programmers behalf.

How your software handles failure must be taken into account when decomposing it into smaller pieces. Doing so is easier said than done.

I have decided, reluctantly to use LaTeX. Making reliable distributed systems in the presence of software errors. Armstrong, 2003

Erlang/OTP is relatively unique in how it chooses to handle failure: supervision trees. Roughly, each process in an Erlang system is started by and watched by a supervisor. When a process encounters a problem, it exits. When a process exits, it is restarted by the supervisor.

(These supervisors are started by a bootstrap process, and when a supervisor encounters a fault, it is restarted by the bootstrap process)

The key idea is that it is quicker to fail-fast and restart than it is to handle errors. Error handling like this may seem counter-intuitive, gaining reliability by giving up when errors happen, but turning things off-and-on again has a knack for suppressing transient faults.

Error handling, and recovery are best done at the outer layers of your code base. This is known as the end-to-end principle. The end-to-end principle argues that it is easier to handle failure at the far ends of a connection than anywhere in the middle. If you have any handling inside, you still have to do the final top level check. If every layer atop must handle errors, so why bother handling them on the inside?

Error handling is one of the many ways in which a system can be tightly bound together. There are many other examples of tight coupling, but it is a little unfair to single one out as being badly designed. Except for IMAP.

In IMAP almost every each operation is a snowflake, with unique options and handling. Error handling is painful: errors can come halfway through the result of another operation.

Instead of UUIDs, IMAP generates unique tokens to identify each message. These can change halfway through the result of an operation too. Many operations are not atomic. It took more than 25 years to get a way to move email from one folder to another that reliably works. There is a special UTF-7 encoding, and a unique base64 encoding too.

I am not making any of this up.

By comparison, both file systems and databases make much better examples of remote storage. With a file system, you have a fixed set of operations, but a multitude of objects you can operate on.

Although SQL may seem like a much broader interface than a filesystem, it follows the same pattern. A number of operations on sets, and a multitude of rows to operate on. Although you can’t always swap out one database for another, it is easier to find something that works with SQL over any homebrew query language.

Other examples of loose coupling are other systems with middleware, or filters and pipelines. For example, Twitter’s Finagle uses a common API for services, and this allows generic timeout handling, retry mechanisms, and authentication checks to be added effortlessly to client and server code.

(I’m sure if I didn’t mention the UNIX pipeline here someone would complain at me)

First we layered our code, but now some of those layers share an interface: a common set of behaviours and operations with a variety of implementations. Good examples of loose coupling are often examples of uniform interfaces.

A healthy code base doesn’t have to be perfectly modular. The modular bit makes it way more fun to write code, in the same way that Lego bricks are fun because they all fit together. A healthy code base has some verbosity, some redundancy, and just enough distance between the moving parts so you won’t trap your hands inside.

Code that is loosely coupled isn’t necessarily easy-to-delete, but it is much easier to replace, and much easier to change too.

Step 7: Keep writing code

Being able to write new code without dealing with old code makes it far easier to experiment with new ideas. It isn’t so much that you should write microservices and not monoliths, but your system should be capable of supporting one or two experiments atop while you work out what you’re doing.

Feature flags are one way to change your mind later. Although feature flags are seen as ways to experiment with features, they allow you to deploy changes without re-deploying your software.

Google Chrome is a spectacular example of the benefits they bring. They found that the hardest part of keeping a regular release cycle, was the time it took to merge long lived feature branches in.

By being able to turn the new code on-and-off without recompiling, larger changes could be broken down into smaller merges without impacting existing code. With new features appearing earlier in the same code base, it made it more obvious when long running feature developement would impact other parts of the code.

A feature flag isn’t just a command line switch, it’s a way of decoupling feature releases from merging branches, and decoupling feature releases from deploying code. Being able to change your mind at runtime becomes increasingly important when it can take hours, days, or weeks to roll out new software. Ask any SRE: Any system that can wake you up at night is one worth being able to control at runtime.

It isn’t so much that you’re iterating, but you have a feedback loop. It is not so much you are building modules to re-use, but isolating components for change. Handling change is not just developing new features but getting rid of old ones too. Writing extensible code is hoping that in three months time, you got everything right. Writing code you can delete is working on the opposite assumption.

The strategies i’ve talked about — layering, isolation, common interfaces, composition — are not about writing good software, but how to build software that can change over time.

The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. […] Hence plan to throw one away; you will, anyhow. Fred Brooks

You don’t need to throw it all away but you will need to delete some of it. Good code isn’t about getting it right the first time. Good code is just legacy code that doesn’t get in the way.

Good code is easy to delete.

Acknowledgments

Thank you to all of my proof readers for your time, patience, and effort.