How GitHub Builds Software

31 July 2016

See every step how we ship changes to github.com starting from a new, empty MacBook.

Presented at TechMeetup Edinburgh in 2016 and FreeAgent and RubyConf Kenya in 2017.

Show transcript

0:00 So the screen's a little bit smaller than I thought it was, so you probably can't read anything on any of the slides, but that's going to be good.
0:07 So we're going to do something a little bit different to you Joel, in that because I'm going to kind of cover quite a lot of stuff, like a surface level,
0:16 let's try and see what happens if you have questions about stuff, ask them throughout.
0:24 And I might tell me this idea like, oh, we're going to do it, but let's try this and see how it goes.
0:31 Right, so this is me, it's even the same t-shirt, conclusionally, yes.
0:38 Where do I get that t-shirt?
0:40 The internet, I think we can't get this one, but we do have a new one, we've got, because it's Pride Month this month, I think, or in America or something.
0:49 So yeah, we have a Pride t-shirt and we have a TransoCat t-shirt as well, so you'll be able to buy them shortly, I'm not meant to talk about that yet.
0:56 What's your dog called?
0:57 A dog is called Lucy, and she is a senior engineer at the art, that's how you can contact her.
1:05 Is she in the room?
1:07 She is a congressman.
1:08 Well, let's, let's, the question thing has already started going slightly, so I may or may not choose to accept your question.
1:20 Moving on.
1:21 So, I'm going to give you this lovely overview, which you can't read, of how you can build software.
1:27 What I'm going to talk to you about is, because I figured it's kind of interesting, I think, not just the way we kind of deploy software, write software, whatever, quite often, like, we end up doing talks about one of these individual things.
1:39 So, I figured it'd be interesting to do a talk based on, okay, you're an engineer at GitHub, and you get given a laptop, and you haven't done anything to that laptop.
1:47 Like, let's talk about, like, what it looks like to go from, you have a new laptop, to you push your code to production.
1:54 So, first thing we do is this little project thing that I built, called Strap.
2:00 So, before there was Strap, there was something called Boxing.
2:03 Boxing was like a previous get-up project that used Puppet to, like, manage people's development machines, and basically that was a fine idea.
2:10 I'm not going to get into much details, but effectively, my high-level thinking, having used Puppet Chef in situations more or less universally where they're a bad idea, is that they work well for what they're designed for.
2:23 They don't work well when people monkey around with a computer underneath, and then Puppet goes, "Hey, I'm going to do some stuff."
2:28 "Oh, you've changed everything, ah!" and then dies.
2:31 And so, I, in, as I alluded to earlier on, I replaced a, like, you know, multi-thousand lines of Puppet and stuff, with a 100-line bash script instead.
2:44 So, this is what Strap does, is you can go to this strat.githubapp.com, or it's like, you can basically, with one click, like, spin it up on Heroku.
2:53 It's, not that one, I'll show you another thing in a second.
2:57 Basically, what it does, is installs, like, a minimal set of stuff that is probably used for developers.
3:04 Now, by minimal set, I mean, like, really minimal, because I hate having, like, cruft on my machine.
3:09 So, pretty much all it does, is stuff like, turn on Firebolt, set up your GitHub, like, credentials, so you can clone repositories and stuff like that.
3:17 And installs Homebrew, but doesn't actually install anything in Homebrew, and stuff like that.
3:22 And installs Homebrew, because I'm a Homebrew maintainer as well, so I'm biased.
3:26 So, basically, the way it works is, you get your new laptop, you then go to this site, it will then ask you to log in, using your GitHub details, you log in doing that.
3:36 And then, you have a little script that you download.
3:39 Note, I'm not doing the whole pipe to the bash script into the terminal thing, because I would actually have done that, but it was going to be harder, so I didn't.
3:46 Yes.
3:47 Sorry.
3:48 No, someone was leaving, or something like that.
3:50 Who knows?
3:51 Right.
3:52 So, I download the script onto my machine, I then run it with bash, like this.
3:56 It's then asking for my password, and so it can go and do stuff like turn on Firebolt, and other things that require root, unfortunately.
4:03 So, it then goes and does a bunch of stuff, and then at the end, you might see something that looks a bit like this.
4:09 Your system is now strapped.
4:10 It doesn't take very long, it takes like maybe five or ten minutes the first time.
4:14 It's designed to rerun multiple times, but you shouldn't have to rerun multiple times.
4:19 So, it's basically just like a one-off, buy and forget thing, and then your machine is set up.
4:23 So, what's happened here is a cool local integration ad as well, the thing I kind of helped work on called Homebrew bundle.
4:30 And what Homebrew bundle does is it uses like a brew file, which is kind of like a gem file.
4:35 you guys familiar with like gem files or similar type things.
4:38 Basically, it works specifying all your project dependencies.
4:41 It's like that, but for stuff in Homebrew, and specifying system dependencies.
4:45 So, I basically have that so I can have a bunch of stuff that I say when I set up a new machine,
4:50 like in my .files repository, it knows to automatically go and download like this brew file,
4:57 and it's not there already, and install everything there.
5:00 So, it can install stuff from Homebrew, Homebrew Cast, and now the Mac App Store as well.
5:04 So, it's installed for me like Xcode and they go into all these other things.
5:08 Right.
5:09 So, this is open source to Strap, and you can go on like microcrate/strap on GitHub,
5:15 and like access the code, and it's got a one-click, put it on Roku thing as well.
5:20 Any question?
5:21 So, from the previous slide?
5:23 Yeah.
5:24 Can you ask how iMV is useful for development?
5:27 It's not really.
5:29 It's useful for doing not development, because I try to do as much not development.
5:33 Yeah?
5:34 How is transmission useful?
5:35 For downloading Ubuntu items.
5:36 That's the last time I used that screenshot again.
5:38 So, yeah.
5:39 This is my little brew file thing.
5:41 It's also in my profile, I'm supposed to really be interested in critiquing my social choices.
5:44 which obviously people are.
5:45 So, it's kind of a good way and you can kind of duff and restore things from that.
5:50 So, it's kind of a nice way of being like, I want to be able to restore everything in my system in a way that's not like, kind of keenest or whatever.
5:57 What is the GitHub app called?
5:58 GitHub app, that is the GitHub desktop app.
5:59 GitHub for Mac.
6:00 It's designed to be kind of like a novice UI for using Git.
6:03 But I find it useful for doing like one thing that I really like UIs for with GitHub, which is like individual like, like staging individuals.
6:19 individual lines to the index.
6:20 Anyway, moving on.
6:21 So, pro-bundle.
6:22 This is the other thing you can do.
6:23 It's part of a rule.
6:24 You run like pro-bundle and it will set itself up to move all.
6:26 So, it's very exciting.
6:27 Right.
6:28 So, you've now got Strap on your machine.
6:29 This is very good.
6:30 So, now I'm going to do this.
6:31 I'm going to do it.
6:32 I'm going to do this.
6:33 I'm going to do it.
6:34 I'm going to do this.
6:35 I'm going to do this.
6:36 I'm going to do this.
6:37 I'm going to do this.
6:38 I'm going to do this.
6:39 I'm going to do this.
6:40 I'm going to do this.
6:41 I'm going to do it.
6:42 I'm going to do this.
6:43 Right.
6:44 So, you've now got Strap on your machine.
6:47 This is very good.
6:48 So, now you want to get GitHub on your machine.
6:51 So, you can actually do some work.
6:53 So, we go make ourselves a little GitHub directory.
6:57 That's just a little convention.
6:58 We go internally.
6:59 And then we cd do it.
7:00 And then we client.
7:01 And then, obviously, that is going to go download.
7:04 GitHub, GitHub.
7:05 That's where the bulk of the site is.
7:08 GitHub is mainly, not entirely, but mainly a one of the Rails app.
7:12 See, you can make Rails still.
7:14 So, this is it.
7:15 Kind of download it to the machine.
7:16 And then, when that's all done at the end.
7:17 Then, the next thing we have is, we've got a nice thought principle on GitHub projects.
7:26 We try and keep the setup for any given project.
7:31 And similarly, what CI looks like, what Tesla looks like, whatever.
7:34 Effectively consistent between every project.
7:36 So, what we have is, we have the script subdirectory.
7:38 And in the scripts subdirectory, you have, in pretty much every project, a thing called
7:43 bootstrap.
7:44 So, what that does is basically says, okay, I have a machine which assumes nothing except
7:49 you have a strap installed.
7:50 So, at this point, I do not have my SQL installed.
7:53 I do not have any Ruby's installed.
7:54 I do not have Git installed.
7:55 I do not have any of these type of things.
7:57 So, this is at this point, sorry, I do have Git installed.
8:00 That's a lie.
8:01 I couldn't say it comes with a risk.
8:03 But, you basically don't have what you would consider system dependencies already installed.
8:08 And this bootstrap strip will then go through the brew file and actually install that stuff
8:13 for you.
8:14 I think that's kind of cool because I always find when I go and use a new project, like I
8:18 hate having some wiki page or whatever.
8:20 It's like you need to manually install these hundred things, then run these hundred scripts.
8:23 Then you tell us Bob what to do next.
8:25 But Bob left the company two years ago.
8:27 And yeah, you probably just got to go to work for a week.
8:30 So.
8:31 Just say compiling everything.
8:32 Alright, yeah.
8:33 So, this bootstrap, it may compile things.
8:36 Like basically, it's a really like high lab abstraction.
8:39 So, it can do more or less whatever it needs to do.
8:43 So, sometimes it will put binaries off the internet.
8:45 Sometimes it will compile things.
8:47 Sometimes it will do other things.
8:49 Like, it basically, it's job is at the end, you should be able to run another script,
8:54 which I'm not actually going to show you today.
8:55 But like, you should be able to run, if it's like a web app, like script server afterwards.
8:59 And script server will just work.
9:01 And will like spin up a web server on the machine so you can like run it up there.
9:06 Right, so what it's going through, it's doing stuff like this.
9:09 It's installed some gems.
9:10 It's put some files in some places.
9:13 It's downloaded GitHub's own internal fork of Git, because we do things with Git.
9:18 And then, as you said, it's compiled it.
9:20 You've got all these things.
9:21 It's going away, doing NPM and all this type of thing.
9:23 And then in the end, eventually it's done.
9:25 It's also nicely set up our kind of local database and stuff like that for us as well.
9:30 And then we're all good to go.
9:32 We're done.
9:33 We have our like special GitHub Ruby and the bootstrap is finished.
9:37 So, now we can actually start doing some work.
9:40 So, if you're interested in the way this stuff is done, we have like a repository for this called scripts to rule them all at GitHub.
9:47 Which is sort of example like template scripts.
9:49 These actual template scripts look almost identical to what they look like in some of our kind of more simple projects.
9:56 The GitHub GitHub template script has to do like loads of stuff because it's setting up a few.
10:02 Like I said, it's like mostly up more with the Grails app.
10:04 It also sets up a few of the other projects that it needs.
10:07 But like for some of our smaller projects, some of the kind of microservices we run, like these bootstrap scripts are like 10 lines of bash.
10:13 And so, they can be really, really nice and simple.
10:18 And you can go to see like bootstrap script there.
10:22 You can see it's kind of referring to blue files from Darwin, which is the current theme on OSX.
10:27 And so then, after that, it's time for me to write code.
10:30 And because I am relatively old, I still use TextMate.
10:34 I think I'm the only person still using TextMate.
10:36 All the cool kids know you use Supply and Atom and things like that.
10:39 And please don't tell me to my co-workers, I'm not using Atom because I'll get in trouble.
10:42 But yeah, so I pull off my editor.
10:47 And then I don't show you any code because I decided to be boring.
10:50 And then I do some stuff.
10:52 And then after I've done some stuff, I run GitDick.
10:54 And it says, in fact, I have done stuff.
10:57 I have changed four lines in this portal.
10:59 So the fun thing with this thing today is partly due to my lack of preparation and partly due to like trying to make this a new example.
11:07 Like this is actually, from now on, this is actually code that I ran and deployed to production today.
11:11 And the stuff you're seeing actually happened.
11:15 I should have some disclaimer or something like that.
11:18 Anyway, so next thing obviously, we're dealing with Git.
11:22 So we go and create ourselves a nice little branch.
11:25 Git's telling us, hey, this file's still not modified.
11:28 It's still not committed.
11:29 So you might want to do something about that.
11:31 So I did do something about that.
11:33 I commit it.
11:34 Write a nice commit message.
11:36 Basically, this is what I was doing this morning.
11:38 The boring little story is we have, we're using things called serialized attributes.
11:45 This is what I was doing today, which is like a way of putting a JSON blog in a database column.
11:49 So like you don't have to do migrations.
11:51 Turns out that's a bad idea in the long term.
11:54 So we've been killing these.
11:56 And one of the things you have to do when you're removing columns, sometimes in Rails, is we have this little thing called attribute ignore, which stops Rails sort of exploding when you want to remove columns underneath it that things should exist and stuff like that.
12:07 So this is what this does and this is what I'm talking about here.
12:11 So I wrote my commit message.
12:13 I then commit.
12:14 I now have a commit on my machine.
12:16 Very exciting.
12:17 And my branch and stuff.
12:18 So now I'm going to go push this onto GitHub, create a new branch, go on to GitHub itself.
12:26 And then if you use GitHub or not, I don't know, you get this nice little thing in the bottom that says, hey, I'm going to create pull process there from this new branch you've got there, buddy.
12:35 So I'm going to now open a pull request.
12:38 And this is going to get interesting again soon.
12:41 I realize there's a point in the middle where we're doing because everyone's used to GitHub at this point where it's like, yes, we all know this, we all do this a hundred times every day.
12:49 Why are you showing us the following bit?
12:50 But it's going to get good.
12:52 Don't worry.
12:53 Right.
12:54 So this is another one convention we tend to do.
12:57 Like we make use of kind of teams and stuff on GitHub and individuals.
13:00 So I'm basically going and seeing a few of my coworkers to ask them for some review.
13:05 We generally try, unless something's like everything is on fire and it's falling down, we try to get at least one person to review every single change we make to like any project at the company.
13:15 Because humans are terrible and it's a good idea to verify that they're being terrible with two people on one.
13:23 So we also like to use gratuitous use of emoji instead of words because pictures are better or something.
13:30 So in this case I've got CC, the person who was working on this thing before I was and my team that I work on so they can kind of check that I'm actually still occasionally thinking.
13:42 Right.
13:43 So we then have, after we push up this branch, we have like a lot of CR jobs that run.
13:48 As you can see, this project has relatively few in that we only have 15 separate CR jobs.
13:52 There's some that have like 30.
13:54 We actually have fun fact, I don't know if this is still case today, but last time I checked GitHub had more CI machines than any other type of machines in the company.
14:03 Because like waiting for builds is really boring.
14:07 So you want to be able to go and push stuff up and have all your builds run in a very short period of time.
14:12 And when you have huge test suites, it's also nice to not be able to, not have to do that on your local machine.
14:18 You can still want to test on your local machine, but it takes forever and that's just for a multi-pod worker instead.
14:23 So, we have all these different things.
14:26 We have this GitHub enterprise code that I used to work on.
14:28 So it's like checking that we've not broken that and GitHub and then GitHub with like various flags enabled and disabled and whatever.
14:35 I'm running through a test suite in various different ways.
14:37 So then after that, this is when it starts to get a little more interesting.
14:41 So now I've decided that my code is like, yeah, it's good enough that I can go and like start planning at least to pretty production.
14:50 So we have, this is in Slack.
14:53 We have this like Slack bot that was previously a campfire bot called Qbot.
14:58 And you ask Qbot to do things and he does things for you and that's cool.
15:02 So here what I'm doing, on a lot of projects what I can do is just say like deploy shape production.
15:08 But because I'm working on the main GitHub app, like it turns out that's actually like pretty busy and pretty congested.
15:14 Like even thankfully I'm in UK hours, so like it's not that congested for me.
15:20 But when I went to deploy this morning, there were like three other people already either waiting to or currently kind of deploying.
15:26 So basically do this to be polite and kind of get in the queue.
15:30 So it says there's one other person ahead of me right now.
15:33 So Q for GitHub, there's currently one guy out there who's testing in production.
15:38 I.e. his code is what's on the website right now.
15:41 And then there's Miguel and me who are currently kind of waiting to deploy our code.
15:45 So Miguel shows up in my PR because he's on my team.
15:50 And says like, looks good to me.
15:53 Love a PR when it contains deleted code.
15:55 I also agree with that sentiment.
15:57 So at this stage as well, all the CI jobs have passed.
16:01 But it's whining at the bottom about it's out of date with the base branch.
16:07 So basically what we'll talk about that more a little bit in a second.
16:11 But that's like a thing you can enable now to basically just say like, hey,
16:15 you might want to merge master into your branch.
16:17 Well, I'm not going to do that right this second because like master is changing all the time.
16:20 If there's two people ahead of me, I'll talk about this in a second.
16:24 But with the deploy, way we do deploy so there's two people ahead of me.
16:27 That's master going to change probably once or twice more before I come to do my deploy.
16:32 So that's push on with this, right.
16:35 So like I said, that's good to go.
16:38 So nice timing here is that it's now time for me to deploy.
16:42 I'll just note as well on the previous one.
16:44 I had merge mastering once before just to make sure that things aren't too weird.
16:49 And I made another change in that like PR before Miguel came along and said that.
16:54 So Cubot is now, I got to the front of the deployment queue.
16:58 So it's now time for me to deploy.
17:00 So what I do now is this is how you deploy, I get home.
17:05 And this is one of the things I think is the best of the company.
17:09 And it's one of the things that kind of terrifies people in that you don't need to speak to anyone.
17:13 You don't need to ask anyone permission, whatever.
17:15 Like in your first week, you will be encouraged to do this and deploy actual code to production.
17:21 Every engineer in the company, regardless of whether they work on this app or that or whatever,
17:25 they can deploy code to any app at any time, more or less.
17:30 And there's times obviously we're getting attacked or whatever that people say,
17:34 okay, let's lock this down to kind of gently discourage people from doing this right now.
17:38 But even then that's kind of more social contract than like an actual one.
17:42 So our way of doing deployment is generally like we put a lot of trust on people.
17:48 So anyone can deploy.
17:50 But if you did deploy and you break everything, we're probably going to notice.
17:54 And we're going to notice that like it was you.
17:56 And we're going to probably ask questions as to like what happened and why that happened or whatever.
18:01 So, in this case, I'm saying deploy GitHub, which is the app I'm working on right now.
18:07 My branch name, which is what I pushed off and I used to create PR.
18:10 Again, our deployments always tend to be around like deploying pull requests.
18:14 We don't kind of put everything to master and then deploy them.
18:18 We deploy our branches to production.
18:21 So, I'm saying to production slash canary.
18:24 This is saying a subset of our production hosts, the canary in the coal mine.
18:28 So, if things start exploding, then that's a few hosts are exploding and not all of those.
18:34 So, then Hubot has just nicely merged master into my branch here and then like push that to the branch.
18:41 And then when the CI finishes, Hubot will then deploy.
18:44 Now the reason why people are merged to mastering is because you want to make sure that whatever the last person did on their branch,
18:50 particularly like they could have done something which was pushing some really, really important fix.
18:55 They commit that to master.
18:56 I haven't been paying attention.
18:57 I haven't noticed that they've done that.
18:59 If I deployed there, if master wasn't merged into my branch, then I would just undo their fix when I deployed my branch to production.
19:06 And that would be bad.
19:08 So, at this point, I then go and it says we're deploying to all these machines, which is the subset where canary is dealing with.
19:17 And then it has some nice little gentle things at the bottom saying like check out this stuff to make sure you've broken anything.
19:24 The other nice thing with, you know, a convention example, which I'm not going to show you today.
19:28 If you skip the canary deployment, Hubot is just really passive aggressive.
19:32 So, instead of blocking you, it just says, Hubot says, skipping the canary deployment, eh?
19:39 A bold strategy.
19:40 Unless you do it anyway.
19:42 So, yeah, so then it has deployed.
19:46 It takes like, in this case, like it took 34 seconds to deploy that to the subset of machines.
19:51 And then, it's now on me to make sure that like deployment has not broken stuff.
19:57 So, I mentioned this deployment confidence dashboard at Haystack.
20:01 Haystack is like our internal new reality thing.
20:04 We're not going to show you that because it's boring.
20:07 But I will show you the deployment dashboard because it's exciting.
20:09 Guess which of the two I built.
20:11 So, this is this dashboard here.
20:15 There's a bunch of graphs that unfortunately, if I tried to put them all on the screen, it would be incomprensible.
20:20 So, instead of one graph, it's incomprensible instead.
20:25 Over here, these blue lines correspond to like when people have deployed stuff.
20:29 So, that last one at the end, that's me having deployed something there.
20:33 And then, these different like colors of lines at the bottom refer to like different types of exceptions, basically.
20:40 Get up, we're being constantly attacked.
20:43 We're being constantly like having people like abusing our API and all sorts of things.
20:47 So, we do have like weird exception spikes and things like that.
20:50 But what I'm looking for here when I go to the deployment confidence dashboard is basically the number of exceptions hasn't gone like that when I deploy.
20:59 If it's, as you have here, like we've got slow queries, like if they are kind of a bit spiky and stuff like that before, then that's okay.
21:06 I know that if that happens during my deployment, I'm okay because this isn't something weird that's happened.
21:12 Due to me, this is just kind of background.
21:15 So, I can't quite see you man.
21:18 What's the timing axis like right now?
21:21 So, we have timing here.
21:23 You can actually adjust it at the top of the page.
21:25 We've got that cropped off.
21:26 But this has gone from 1:48 in the morning San Francisco time until 2:16 at the end.
21:33 But we, at the top of the page, as I say that cropped off, you can adjust that to be, time axis to be anywhere from like 15 minutes to 30 days to kind of like look at patterns.
21:42 Obviously, when you're kind of interested in how much is like B and my background noise versus the kind of general noise.
21:48 And obviously, like the way our traffic behaves like varies from day to day during the week and stuff as well.
21:54 So, it's sometimes useful to like look at a seven day axis and whatever.
21:58 And then here on the left hand side, we have like the number of exceptions that are going on per minute.
22:04 So, after my deploy has been in Canary for long enough and I've not been paged and none of my coworkers have been paged by me being bad.
22:15 And then, he goes and says, okay, you've been in production long enough for the Canary at least.
22:21 Probably alright to deploy production problems.
22:24 So, and now then, well this command below and then that's going to go and deploy that to every production machine that we have.
22:31 How is the bot determining an adequate time in production to fuel canary?
22:36 So, it's doing it based on basically just the amount of time.
22:39 Since the deployment finished, it's basically after that about some time in the background.
22:44 We're just going and then booking.
22:45 And if there's not, I think it's also triggered such that if no one has been paged due to like an incident happening,
22:53 then that's like decided that that is an okay metric.
22:57 You've not broken anything badly enough for people to be working with you in one night.
23:01 Therefore, you're probably safe to go through.
23:03 But at this point anyway, you generally will still be like checking stuff and making sure that you definitely haven't broken stuff.
23:10 So, after that, it's going to deploy through.
23:15 Again, it takes 148 seconds to deploy to all our production machines.
23:18 And then I get this again, just a little reminder to go and check that I haven't broken things.
23:22 Because, yeah, we do this sometimes.
23:24 Fun fact, in very early GitHub days, some of you may remember this.
23:29 This stuff used to also include checking Twitter to make sure and see if anyone is currently complaining about GitHub isn't broken.
23:36 So, yeah, so then I'm going to go back to the Deployment Confidence dashboard.
23:43 I now have my little blue line from 2.14 back there and my other little blue line from my new deployment back there.
23:48 It's starting to kind of creak up a little bit.
23:50 So that's a little bit concerning, but I'm not going to be too concerned about that.
23:54 And then in the PR itself, and again, you can all do this through the API.
23:59 This is all like fully supported stuff.
24:01 It nicely tells me, okay, I've done these deployments here.
24:04 Support to Canary and then I deployed to production.
24:06 So now in the pull request itself, it's now showing people where they come along with this pull request.
24:12 Like that I have been a good boy.
24:14 I went to Canary first.
24:16 I've not just merged as the master without testing it.
24:19 That would also be bad.
24:20 So then after this, I'm relatively confident this code is working.
24:24 It's fine.
24:25 I then merged this pull request then.
24:27 So then what that's going to do after that is that it says the next person in deployment queue can then go off and deploy stuff themselves.
24:33 It's worth maybe mentioning a little bit because I've heard a lot of different ways that people do deployment.
24:38 And I've kind of worked at a few companies to do deployment in different ways.
24:42 And I'm with the humble opinion that this is the best and possibly the only decent way of deploying to web app service at scale.
24:51 And the reason why is because the thing that quite a lot of companies do is kind of this idea of maybe deploy whatever,
24:59 four times a day, ten times a day, whatever, or in big organizations, four times a month, four times a decade, whatever.
25:07 The problem with that is if you have a bunch of people working on stuff and one of them breaks something in that time period,
25:14 how do you determine who broke what and when?
25:18 How do you know who broke it?
25:19 How do you know what they were doing and whether that breakage is maybe like whether there's stuff in that production deployed that's more important than the breakage or whatever.
25:33 And I think the nice thing I really like about this model at least, is it's a way of basically, a program of responsibility that has to the developer, is that we have an infrastructure team and they kind of keep an eye on this and make sure our systems are running properly.
25:46 but whether I break something around production is down to the individual developer and all of our engineers and most of our designers and product people as well, if they're making changes to the site, they will be the ones who deploy it to production, they will be the ones who monitor it, they will be the ones who monitor it.
26:04 They will be the ones who choose whether or not to back it out, whether the stuff's broken or whether to merge the request and move things forward if things are working.
26:23 And for me at least, that's a nice way of balancing the responsibility aspect with being able to identify what problems are early on aspect.
26:33 So, after all this has happened, I then sporadically am going to go and check the Columbus Confidence dashboard again, because I still, even though it's merged and stuff like that, someone else's stuff, I still want to make sure that I haven't broken anything because I'm breaking this as well.
26:50 And that's basically it.
26:52 So, you've seen through this flow of like what I did for example today to actually like push a change to production.
26:59 And this is something that probably almost every engineer in GitHub is doing like probably zero to three times a day.
27:09 And so, I can't remember the, I should have got the number off in my head, but the number of deploys we have in a given day is probably like, we're probably deploying every 10, 20 minutes, pretty much 24 hours a day, like all the way through.
27:24 Because we have people in time zones all around the world.
27:27 The cool thing is, is that all this stuff can be done in such a way that we have a smart infrastructure team that means that stuff can be done in such a way that it doesn't disrupt our site.
27:35 And that we can continue to work and iterate in very small chunks, constantly, rather than having like big, big ships where, as I mentioned before, all the kind of risks that come with that approach.
27:46 So, we've run through these things.
27:48 So, first we've reached out to our back.
27:50 That's the default machine you have at GitHub.
27:53 We've reached out to GitHub.
27:54 We've reached out to GitHub.
27:55 We've wrote some code.
27:56 We've committed it.
27:57 We created a pull request.
27:58 We then deployed that to production.
28:00 We then make sure that we didn't break stuff.
28:02 And then we launched it.
28:04 So, there's some of the open source projects I mentioned there.
28:07 And if there's any other questions, please.
28:10 Yeah?
28:11 So, what was the system before this Hubot was used?
28:20 What was the system that Hubot used?
28:22 Hubot is also open source.
28:23 It's a Node.js based, I think.
28:26 It has various like back ends that can plug into like Camper, HipChat, Slat, like IRC, Jabber, all these other ones.
28:36 The way deployments actually happen, that's like an internal app.
28:39 But, like, it's not really that hard that we provide all this stuff through the deployment API.
28:43 Basically, all you would need to implement to do that is, however you currently do your deployment process,
28:49 have Hubot make an HTTP call to whatever your current thing is saying,
28:54 "Hey, I want you to deploy this branch here."
28:57 And as long as you can do that, and make an outgoing HTTP call back to GitHub saying,
29:03 "This is what was deployed here and when," then you can let that's it.
29:08 So, the deployment thing itself is not that hard, provided you're not like manually FTPing machines.
29:13 Sorry, manually FTPing files to your machines, which probably still is.
29:18 Yeah.
29:19 So, can you drill down that dashboard like into what the actual exceptions are?
29:23 So, you say you know it's like a...
29:24 So, that's what Haystack is doing and that's why I kind of didn't dig in there,
29:28 because there's a certain amount of background noise and it wouldn't be that interesting,
29:31 and it maybe might not, like, make me or GitHub look as good as we should.
29:36 But, yeah, basically, you can, if there was a big spike like that, yeah, your next step would be to jump into Haystack,
29:42 see what the exceptions are that are spiking.
29:45 Some examples in the last few weeks, like, the two typical things that happen is the good one in some ways is the thing that spikes is, you know,
29:54 something that's bad in your code, and it's a new exception, you haven't seen it before, and it's spiking because, like,
29:59 there's a bunch of people that hit this error.
30:01 That's, that's the better one.
30:03 The bad ones are when the kind of background one looks like you can all increase.
30:07 So, you've, like, destroyed somehow performance very consistently your entire site, which means you're probably doing something hideous to the,
30:14 you know, something hideous to the backend or to a database or whatever, and then those are the ones that are a bit harder to debug,
30:22 and those are the ones where, again, in probably both of those cases, you would then immediately back out your branch,
30:27 and with the speed of unemployment, that means, you know, if you're paying attention,
30:31 it probably shouldn't affect people for more than, you know, 10 or 20 seconds at worst.
30:37 Yep.
30:38 I think I must have missed it in your talk on the slides, but what is the time scale between going to Canary and production and then...
30:44 So, I think it's 15 minutes, or 10 or 15 minutes from Canary to, like, doing a full production deploy,
30:54 and then I think afterwards, we recommend, like, 15 to 30 minutes, you kind of are monitoring a production.
31:01 But again, that varies very much on the change you're doing.
31:04 Like, in this case, like, I was basically 100% certain that this wasn't going to cause any adverse effects.
31:10 I test it locally, and I know that, like, because I've done this on the same thing on another table before,
31:15 I know that, like, the way it's been handling this in Rails, that I'm removing dead code, effectively, at this point.
31:21 If I was pushing something much more dramatic, I would probably spend a lot more time monitoring it,
31:27 and then that's afterwards when you would spend time monitoring either other people's deployers,
31:31 because your code's still in production, just to make sure that, like, you have a few errors or whatever.
31:36 Sorry, you basically block out people's deploying until you merge it, or is that right?
31:43 Yeah, so, yeah, the floor is, I've got a branch, I deployed a branch, the contents of that branch
31:49 is now what is deploying with the production service.
31:51 No one else can do any deployments until I'm then done testing my branch.
31:55 The result of the testing of the branch will be one or two things.
31:58 Either I decide that my branch is not ready for production, in which case I back out,
32:02 I basically say I'm done, and then the next person goes to the front of the queue,
32:05 or I merge my branch, like, merge my pool request even, and then, at that point,
32:11 it then also says the next person goes to the front of the queue.
32:16 Yeah? Sorry, what was your...
32:20 Yeah, so, we've been investigating that. I think the tricky thing with us is doing that with the way everything is set up at the moment,
32:38 and the way stuff like that we handle data-based migrations, the way we handle assets,
32:44 stuff like that makes it kind of tricky for us to do that at the moment.
32:47 I think that's definitely kind of a goal, eventually, that we should be able to just,
32:52 as you say, roll things out a lot more fluidly like that.
32:55 In fact, it looks a small bit longer than that.
32:58 Yeah, yeah. So, I guess that's what the Canary is kind of doing a bit of that,
33:03 like a light version of... effectively, when we do the Canary, we roll out to, like, one of each type of workers,
33:11 so we, like, have, like, a front-end, the back-end, like, a background job processor, whatever.
33:16 So, one of each of them, like, gets this code, and then we, like, see what happens.
33:20 But, yeah, try it. If that was done automatically, it automatically stuck, like, back and start, and that would be cool.
33:27 James?
33:28 So, how, so, you talked about doing small pieces, but how do you handle the bigger pieces where everything changes at once?
33:35 So, like, last year, there was a B2Y change, all the tabs and things, so how has that been changed?
33:39 Very good question. So, those changes are actually often not even a pull request.
33:44 So, what we do is we have, this would be, could be a tool by itself, we do things called, like, staff shipping and dock shipping.
33:54 So, GitHub staff will have, like, a special, like, flag on their profile, and part of that means that we get, like, opted in or out of certain features.
34:02 So, I had to remember my screenshot today to go and, like, turn off the staff mode, because everything looks, like, different,
34:08 because we're trying and experimenting with all these new features.
34:11 Cool thing about that is that means that that code is all in production right now.
34:15 Like, it's sitting there, it's live, it's just more or less, it's not an if statement we have of, like, a nicer way of doing things than that.
34:23 But, you know, it's effectively an if statement of are you staff or are you in this team?
34:27 If not, give you the old code, if so, give you the new code.
34:30 It's nice because it encourages you to write code and work with both parts at once.
34:34 But it also means when you come to ship stock production, it's been tested for quite a while,
34:38 and the actual flipping the switch is just moving that feature from, you know, from just start to everyone.
34:46 But also with those things as well, what we tend to do is you can, that's one way you can do the gradual rollout.
34:51 You can say, give this to 1% of users, 10%, 25%, 50%, 100%.
34:56 People doing JavaScript and stuff, which I don't do, also do clever things where we, like, dock ship it.
35:01 So sometimes what we'll do is we'll actually, like, render the entire feature in the background of the page
35:06 and just not display it to make sure that we're not introducing, like, JavaScript errors and stuff like that.
35:11 So we can dock ship it such that that new feature is there hiding on the page.
35:16 You never see it, but that's our way of verifying that, like, you know, it may have rendering issues,
35:20 but it's not going to, like, bring down the site, for example.
35:22 And again, we can, like, roll out those dock ships up gradually.
35:25 So by the time it actually comes to, like, roll out a big massive UI refresh, yeah,
35:29 it's just a matter of, like, clicking the button instead of having to merge port across the image.
35:34 Yeah.
35:36 What are the, sort of, typical bones of contention between developers and the infrastructure teams?
35:42 I guess the usual ones.
35:44 So it was bad to, like, about two years ago, there was a banish situation and a bunch of people myself
35:52 who had got, kind of, drafted into the infrastructure team.
35:54 And there was a, the usual, kind of, contention you tend to have with ops people and application people
36:00 where, yeah, the ops people felt that because they were the ones on page rotations
36:05 and the application people weren't, the application people would just throw stuff over the wall
36:09 and if everything blew up, that's fine, I don't need to worry about that, I'm not getting paged.
36:14 And obviously responsible application people would be, like, being responsible.
36:18 And irresponsible application people would be waking people up at three in the morning.
36:21 Because this is the, I guess, the blessing and curse of a geographically distributed workforce
36:26 and one where we don't mandate people working offices and stuff like that is that, you know,
36:30 I'm working away quite happily, you know, nine to five, and then that's, you know, three in the morning
36:35 because someone's office is good.
36:36 So those kind of balances and then when people are getting woken up again and again and again and again
36:42 and again by people being irresponsible, then, like, there's tension.
36:46 But I think that's something that we definitely resolved and a big part of resolving that has been
36:52 part of the infrastructure too, like, making some of our systems more kind of self-healing.
36:56 And part of that has also been, like, the application engineers, a lot of them are now taking pagers.
37:01 And when I say pagers, it's not literal pagers anymore.
37:05 We use, like, the pager duty and it's, like, integrated with the keyboard and stuff like that.
37:09 Such that if you bring stuff, you will get paged sometimes as well.
37:13 And generally having, like, some kind of microservices now have their own little pager retitions.
37:17 I'm on call for this little microservices that manage, like, 50% a day.
37:20 But, again, this is the classic thing that gets sold with phones like this.
37:24 But, surprise, surprise, that microservice never ever goes down.
37:27 Because the first time they woke me up at 3:00 in the morning, I'm like,
37:29 this is never going to go down and wake me up at 3:00 in the morning ever again.
37:33 So that's the kind of typical tensions.
37:36 I would imagine people in here in ops or engineering would kind of relate to that general relationship.
37:42 And I think, to be honest, I think that is a relatively healthy tension in that, you know,
37:46 most companies, if, you know, the ops people might want to make sure that no one ships code ever again
37:52 because everything's nice and stable and working out and everyone's getting paged and it's fine and nothing's on fire.
37:56 And the application engineers want to ship things a million times a day because, you know,
38:00 if I break it, I can just push that fix. It's fine. Don't worry about it.
38:03 And that tension, I think, is healthy and it results in better software.
38:08 And at the same time, software that's kind of iterated on costs or so.
38:15 I think we're all good. Thank you very much for the questions.

Mike McQuaid

How GitHub Builds Software