It’s not been a wonderful few weeks for techology. My home Internet connection has re-developed a condition which is commonly refered in technical circles to as “slow as arse”. After a nice morning at the beach, I managed to drop my phone when I got home, which resulted in that joyous moment of picking it up to see a smashed screen; Much like toast, when dropped an iPhone is guaranteed to land buttered side down. I mean screen side down.
And to top it off, my laptop blew up.
Blew up is actually a bit of an overstatement – The discrete GPU failed (a Radeon HD 6770M); The remainder of the computer continues to work. It’s a 2011 Macbook Pro, and suffers from a well known failure mode. The obvious symptom of this is getting stuck at a grey screen after the boot progress bar gets to the end. The Internet will tell you that it’s absolutely definitely rumoured to be a failure due to removal of lead solder from the manufacturing process, and that the replacement solder didn’t deal with heat so well, eventually causing the chip to become partially desoldered.
The issue is that I’ve already had that failure, and had the board replaced (for free) in August. The original board lasted about 3.5 years (although it stopped being able to drive an external screen some time before failing completely). The new board failed 5 months into its life. So, does this mean the free replacement program just threw in spare boards that had the same defect? Did they have workforce issues and have to hire unemployed clowns to run the factory?
The reason, of course, is that the new boards have exactly the same (potential) issue. They’re not going to redesign and build these components for an out of production machine. According to the tech I spoke to, a few laptops he’s worked on have already failed at least twice under this scenario. Getting 4 years out of a computer without spending a dollar in repair costs is pretty decent, especially given the amount of use I get out of it – I generally consider a laptop to be good for about 3-4 years. If my computer died 14 months in, that would be a different story.
On the upside, after some messing about, the Lappy rides again. The replacement program runs some time into Feburary (if you bought your computer more than 3 years ago), so if you’re still considering whether it’s worth getting your broken laptop fixed for free, you’re running out of time. Apparently they wont tell you this, but each board comes with a 12 month warranty – not the 90 day warranty you get told about. Which makes sense; I can’t see how under consumer law in Australia a 90 day warranty is acceptable. Sure, this part was free in this case. But it was replaced under the auspices of the normal part replacement, which is usually a paid programme for out-of-warranty computers.
So that’s great, but it doesn’t solve the problem of how to get on with things over the few days before the lappy could be fixed.
Whilst the discrete GPU is dead, the rest of the computer continues to work, including the integrated Intel GPU. Which is pretty wild really – I might have had a component become disconnected in my portable computer, but can continue to compute by firing up the spare graphics subsystem. At least, in theory.
When both GPUs are working, the OS seamlessly switches between them on demand. Most of the time, you’re not aware that the change is happening. Older versions of f.lux used to get tripped up and the colour temperature would switch as the system flicked between GPUs. Newer versions seem to have this fixed. And there’s an excellent utility gfxCardStatus which gives you visibilty and some control over when this changes.
Whilst I can think of a few improvements to the logic of when to switch GPUs, this really is a marvel that has been largely ignored. When these dual-GPU machines came out, you had to log out to switch between chips – effectively restarting the graphics subsystem. Quietly, some release of OS X a few years ago had this re-engineered such that the output device could be swapped on the fly. I have no idea what the inside of this code looks like, but when you consider the amount of state flying around (think of what’s going on in the OpenGL/CL world in particular), this is a major achievement. In this age of operating system marketing, this unfortunately was mentioned (I think) only as a brief footnote – if at all. Sadly, most of the marketing space was dedicated to things like ZOMG New EMOTICONS!!1!11!! ⛵️⛵️⛵️⛵️⛵️⛵️
Unfortunately, if the discrete GPU is dead, the OS can’t leave it alone and move on. Evidently the main graphics init process, which comes pretty late in the boot order, involves probing and initialising both GPUs. Under most normal use scenarious the integrated GPU is then used. There doesn’t appear to be a way to tell the system just to ignore the discrete GPU. Very possibly there’s some kind of switching gizmometry going on between both chips that means you can’t just not initialise one of them. Or, just maybe the code wasn’t written with this scenario in mind, and always assumed both are working. The primary purpose of these dual-GPU setups isn’t redundancy – it’s performance vs battery life. Handling a hardware failure is certainly a nice-to-have, but unfortunately, it seems we don’t have.
After some Googling, there really doesn’t appear to be a clean way to force the discrete GPU off. There’s been some discussion of EFI variables, but this didn’t work for me – they appear to be hardware specific. Some kind of nvram / EFI setting would have made the process a lot smother.
You can still boot the OS to text mode (ie single user) and it works fine; Assumedly this uses the integrated GPU, so long as you don’t have an external screen plugged in. So we still have a working computer and console accesss. There’s a process to boot a Mac in this state with working graphics, and that’s to disable the AMD drivers. Basically, if you move /System/Library/Extensions/AMD* to somewhere else, and reboot, you should get a kinda working graphics system. This involves the system booting twice – I’d say the first time involves rebuilding the kext cache if you’ve not done it explicitly.
The problem is that I’m guessing most people who have had this issue have now had their motherboards replaced, or gotten on with life. In other words, most of the discussion took place before upgrading to El Capitan, and System Integrity Protection. It’s no longer trivial to boot to single user mode and move some files out from under /System.
So, no problem, I’ll just disable SIP. That’s easy enough. Normal process is to boot to the recovery partition, fire up a terminal, and one command turns it off. So, just reboot, hold down command-R and …. oh. Recovery mode uses graphics… oh.
Not that I could find it documented anywhere, but at least on my laptop, it’s possible to boot the recovery partition into single user mode (ie text). I took a punt and it worked – Hold down Command-R-S on startup. The Apple support doc listing startup keys doesn’t explicitly state that you can or can’t combine them. Chosing which partition to boot from is orthogonal to how far through the startup process to go, so there’s no obvious reason why you can’t combine key combos when it makes sense.
You then should be able to disable SIP like normal (csrutil disable). This, I assume, will only work with the local recovery partition. Booting into recovery mode from the internet will bring down the OS version that was originally loaded on the computer when it was shipped. As of the time of writing, most laptops will be older than El Capitan, and won’t know anything about SIP.
After you’ve got SIP off, you can fall back to the known methods, some of which apparently include intentionally overheating your computer to force the discrete GPU off and falling back to the integrated one.
If your experience with Unix is ‘the internet told me to paste this into terminal’, best off avoiding these procedures. It is only moving files around, but get it wrong and it gets very tricky to fix if you don’t know what you’re doing.
After messing about, it’s kinda working, again, but it certainly ain’t the normal integrated GPU graphics. gfxCardStatus still sees the discrete GPU as active. I’m guessing this is some kind of fallback, non accelerated mode, where all the graphics are drawn in software. All the video hardware is really doing here is providing a buffer like in the days of yore when computers were stream powered. All Macs produced any time recently have had some kind of GPU, so this must be a fallback/reference mode not intended for regular use. It’s buggy – there are graphical artifacts everywhere. Unsurprisingly the frame rate when dragging a window around is low. But it still is vastly more useful than no computer at all.