25 avr. 2013
The SoC GPU driver interview
A few weeks ago, I published a status article about a few SoC GPU drivers. SoC GPUs are a very hot topic that deserves a prominent exposition (even more than desktop GPU drivers in my humble opinion).
The current crop of SoC already provide unbelievable power to game developers and GPU lovers - and the versions that are coming are even more powerful. In a few years they will be able to compete with traditional desktop GPUs on the laptop market. As they become more and more powerful there is good chance that they will be able to compete with them in the desktop world as well. In the same time, the traditional AMD or NVIDIA desktop GPUs will have a hard time to remove them from the mobile market as these SoC GPUs are both powerful and energy efficient.
There is a good chance that SoC GPUs will take over the world in the coming years. That's part of the reason why I believe that the new series of open source GPU driver is of the utmost importance. And that's why I contacted several driver/tool maintainers/contributors:
- Connor Abbot (CA) - Open GPU Tools (Connor works on lima support here)
- Eric Faye-Lund (EF) - grate (for Tegra GPUs)
- Herman H. Hermitage (HH) - Videocore (for Broadcom GPUs)
- Luc Verhaegen (LV) - lima (for MALI GPUs)utter
- Matthias Gottschlag (MG) - Videocore (for Broadcom GPUs)
- Rob Clark (RC) - freedreno (for Adreno GPUs)
- Thierry Reding (TR) - grate (for Tegra GPUs)
- Scott Mansell (SM) - Videocore (for Broadcom GPUs)
- Wladimir J. van der Laan (WL) - etna_viv (for Vivante GPUs)
The good news is that they were kind enough to answer my rude and pitiful requests over the course of several weeks (the interview process started on March the 14th and ended on April the 20th). I'd like to thank them all for allowing me to steal their valuable time.
Before we really begin, let me innovate a bit and do something no other interviewer has ever done. Let me ask this incredibly novel and dense question : who are you?
RC: I'm Rob Clark. I've been working on arm/SoC stuff for quite a long time, and at the time I started on freedreno I was working for Texas Instruments. (Now working at Red Hat.)
SM: I get the feeling this might be the hardest question in the interview.
Lets just say I'm a 24 year old Computer Science student from New Zealand. Previously I've been involved with projects like hacking cheap Chinese mp3 players and enabling homebrew software on the PlayStation 3.
WL: I'm just a guy from the Netherlands interested in GPUs, GPGPU, and reverse engineering. I did a CS PhD on processing volume data on GPUs.
I've been working on this kind of stuff for quite a while, for example back in ~2008 I figured out the NVidia G80 shader isa and wrote decuda, a (dis)assembler. My first experience with ARM reverse engineering was in 2004 with Nokia phone firmware.
MG: I am a 21-year-old student from Germany who enjoys hacking on low-level code. Currently I am writing my bachelor thesis on virtualization and NVidia GPUs, together with current exams that doesn't leave any time for reverse engineering at the moment unfortunately.
TR: My name is Thierry Reding. I work as a software engineer at Avionic Design. I mostly do kernel and plumbing work and have lately been involved with the effort to provide better upstream support for NVIDIA Tegra SoCs.
EF: I am Erik Faye-Lund, a graphics developer from Norway. I used to work at ARM (on the Mali OpenGL ES team), so I'm not without experience from the field. I worked on the Mali software-team pretty much from its infancy, up until the Mali-400 started shipping. I'm currently having fun with poking into the Tegra 3D registers, with great help from Thierry Reding.
HH: I'm Herman Hermitage. I've enjoyed working and tinkering with software and hardware since the 80s across a range of areas, particularly graphics, reverse engineering and security. I've had less chance in recent years to focus on these things, so I picked up a RaspberryPi in 2011 to tinker.
LV: I am a 33 years old belgian who moved to Nuernberg, germany, for SuSE Linux in 2007, and i liked the place so much that i stayed. I have been working on open source graphics drivers for close to a decade. I have pioneered some fundamental ideas for modesetting, I have played a rather crucial role in freeing ATI, have done a few other things left and right, and now I seem to have started this open ARM GPU thing. I usually find myself swimming against the stream and trailblazing something or another, and stepping on a lot of toes in the process, even if that is simply by stealing someone elses thunder.
CA: As you may have gathered from Luc, I am a 16 year old high school student with a rather... unusual hobby :). I started down this path at around 8, when my father gave me a book on the C language, and I've been learning things on my own ever since. I'm currently trying (and sometimes succeeding, sometimes not) at balancing this project, my schoolwork, summer internships, and the college admissions process (as I am a junior in high school, it usually starts kicking into gear around this time).
What is the current state of your respective projects?
RC: Basic adreno support for a220 is just recently merged to mesa git master. Some feature and performance work still to go. I believe that with not too much work this should work on other a2xx devices. I've just recently started on r/e for a3xx, which is somewhat different compared to a2xx (new shader ISA, bunch of register changes and reshuffling). At this point for a3xx I have a working shader disassembler but I've not had time yet to go to far on the cmdstream parsing and figuring out the new registers, etc.
: I actually have a shader assembler, and all the libfdre tests working now on a320/nexus4.
I'm just starting now on the gallium driver support.. or rather, starting on the pre-step of refactoring of the current gallium driver to split out the parts that I have discovered as common between a2xx and a3xx (ie. some of the higher level stuff, such as how tiling works) from the parts that are not (ie. compiler, and all the register level stuff)
a3xx seems quite fast so far, and the compiler will certainly be interesting to write :-)
SM: I'm not exactly up to date with our status, so this may be slightly out of date.
We have mostly reverse engineered the custom instruction set of the VPU (One of many cores on the chip; it's a standard scalar CPU optimised for multimedia (particularly video encoding/decoding at Standard Definition resolutions). It is responsible for booting the system, loading the linux kernel into memory before the arm core even executes a single instruction. Once linux (or another OS) boots it acts as a multimedia co-processor running the OpenGL driver and offering various video/audio/image decoding/encoding services to the main OS).
We can run custom code on the VPU, either by replacing the bootloader on the SD card, or from linux using backdoor which the Raspberry Pi Foundation nicely added for us. Right now code has to be written in assembler, but there are a few efforts to port a compiler to the VPU arch.
As for the actual GPU part of the Videocore, the instruction set of the QPUs (universal shader cores) is known, but most of the hardware registers are unknown so we have no idea how to run them. From boot we can't even get a framebuffer or enable the arm core.
Currently we are working on documenting all those registers.
WL: I figured out the state bits and command stream format needed for rendering, as well as the shader ISA. I have a working and fairly complete Gallium-ish driver for GC600/800 (and sjhill is working on GC880 support), the biggest thing still missing is the TGSI->Shader compiler, which I'm currently working on. Next on my list is integrating the stuff into MESA, to get OpenGL support, and work on GC2000 and OpenCL support.
MG: I don't think I have to add much to what Scott said. I have been working on figuring out more about hardware initialization, but I would say that it is considerably more difficult with the Pi compared to other GPUs. That's because it isn't just a GPU, so we cannot just "run the binary driver and capture its output", because there is no underlying operating system we could use. My solution to that has been to write an emulator for the firmware, but the emulation means that there are more than just a few timing related issues where the firmware will not work as expected.
But hey, we have SDRAM up and running - some months ago all we had was 256kB of L2 cache!
TR: The starting point for most of the current work is a manual playback of a simple OpenGL ES 3-color smooth shaded triangle. Erik Faye-Lund has done a fantastic job at figuring out a lot of the details.
When I started working on grate initially I kept looking at Erik's work a lot for inspiration. One piece that was still missing, though, was a way to see actual results from feeding a command stream to the 3D engine. Perhaps my biggest contribution yet was to figure out how to detile the framebuffer and write it to a PNG so that one could look at the results.
Right now we're still in a very early stage and the work cycle basically consists of modifying the command stream, running the program and look at the generated PNG file. One of the things I'm currently working on is to get support into the mainline kernel for displaying render buffers on top of the tegra-drm driver (possibly using hardware overlays). Erik is certainly busy as well, but I'll let him speak for himself.
I should say that NVIDIA has so far been very helpful in providing a lot of code (like the host1x kernel driver) and the occasional hint along the way. They are very silent when it comes to the internals of the 3D engine but perhaps that will change as time goes by.
EF: I'm currently working on trying to confirm the suspicions I've build up from looking at several megabytes of command streams.
So far we've managed to replay a rendering job, modify the viewport and scissor box as we want, perform clears of tiled and linear framebuffers, modify the draw-call parameters.
I've also managed to figure out how primitive restart (ish) works, so draw-calls beyond the limitations of the hardware-register bits (more than 4096 vertex indices) can be split into multiple draws.
In addition to this, we have somewhat working disassemblers for the vertex and fragment shader units.
HH: Early days but good steady progress has been made. We have a good understanding of user mode on the BCM2835 VideoCore cpu - this has very much been a group activity. Tiernan Hubble mastered the integer vector instruction set encodings and Eizo-san has used these in anger for exploring accelerating X11. Mark Marshall has added initial VideoCore IV support to bintuils, and I believe he and Felipe Magno de Almeida are exploring GCC support whilst Scott is exploring with LLVM support. In terms of exploiting floating point compute power, we have the encodings established for the unified shader (QPU) and a method for intercepting them as OpenGL ES shaders are compiled - however this work hasn't been made public yet (not in a fit state). Mattias is doing amazing work on emulating and capturing hardware register traces to get a better understanding of the functional blocks in the SoC. The firmware blob is running ThreadX and essentially it's an entire OS and set of services in its own right, with ARM + Linux a tiny parasite on the side. The blob is a statically linked thing on the RPi, so its a little hard to replace say just the Khronos library (OpenGL ES, OpenVG etc) with a open version without getting in very deep - ie a complete rewrite of all services. My own personal focus is more on coming up with a way to exploit the compute power that exists in the silicon, and undertaking analysis on the blob.
LV: Mali-200/400 is a rather crazy architecture. All of the optimization is baked into the design, and when everything is implemented correctly, it simply is this blazingly fast. The craziest part of it is probably the vertex shader. Connor Abbott, the high school student who has spent his past years spare time on reverse engineering the mali shaders, has finally gotten to a stage where his compiler can produce correct results. Due to the architecture of the mali vertex shader, this was quite an amazing amount of work. And yes, i said compiler there... He had no other option but to implement a partial compiler for this architecture, as an assembler on its own just could not produce any useful results and needs the timing of instructions baked in as well.
With the output of Connors shader compiler, we now have a port of Quake 3 Arena running on top of the prototype driver. We are measurably beating the mali binary driver with our proto driver, and our pre-compiled shaders, when running the Quake 3 Arena timedemo. And this simply by doing things right, and not by depending on some application or SoC specific hacks for those few extra fps. This is quite unique and cannot be underestimated, we have the possibility with the mali-200/400 to be as fast as the binary driver, in the general case.
Now that we have proven this capability with our research code, i am satisfied that we know enough to be able to properly implement mesa support. And this is our next big task, for both me and Connor.
And once we have a decent mesa driver going, we will move on to the new mali generation, and with all the experience we gathered from mali-400, we will make minced meat of the t6xx :)
CA: There are two parts for what I'm trying to do: the Geometry Processor (GP) which runs vertex shaders and the Pixel Processor (PP) which runs fragment shaders - they both have completely different ISA's so unfortunately we can't share much code/effort between them. Eventually, I'd like to have a compiler hooked into Gallium/TGSI for both of them that works at least as well as the binary compiler, which is a somewhat far-fetched goal (at least without some more help :) ), but the most important bit is to get something working, even if it isn't very optimized yet. I've gotten carried away sometimes, but I hope I can start looking into Gallium very soon - although it might require some changes to the interface which will take some time.
The first stage was reverse-engineering the ISA for both processors, so that we could understand them well enough to produce our own shaders. This took Ben Brewer (a Codethink employee who worked on shaders with me, although he sadly isn't working on it currently) and I ~4-5 months to do, by which time we had almost everything nailed down (although new details seem to keep coming up now and then...). Next, we started writing our own compiler backends. Ben started work on ones for both the GP and PP, although sadly he had to leave when it was at a very early stage. We decided to write the compiler backends first, doing the parts we knew needed to be done and then looking at translating TGSI to our IR once we had it working. Since then, I've since revamped/rewrote a lot of code that he wrote for the PP, adding a new backend based on a written-from-scratch IR (which I call pp_lir) to replace the half-baked backend in his lima_ir (which I've since renamed to pp_hir while significantly changing/expanding it). As for the GP, after some consideration I ended up scrapping the work he had done for it, writing my own... rather interesting IR to deal with the highly unique architecture.
I was working on this when Luc mentioned his Q3A demo to me, and pointed out that it would be best if we could make the entire thing open-source by using these experimental assemblers/IR's to generate the required shaders that were, as of then, being generated using the binary compiler that comes with the Mali drivers. I replied that the PP would be easy, since we can just manually convert the ESSL shaders into assembly and then use our assembler to produce the final binary (I could have used pp_hir and pp_lir, but the generated code would have been less optimal since some optimizations haven't been implemented yet). This method wouldn't work with the GP, however, because it's impossible to write any shaders that aren't incredibly simple in assembly. This is due to the fact that the architecture is scalar (i.e. each component of every vector operation must be scheduled independently, meaning that even the simplest shaders have a lot of operations to schedule) and there are many complicated constraints that must be met on each operation to "make it fit" in order to avoid register loads/stores and use up limited resources in an efficient manner. Therefore, we had to write our own scheduler with some kind of register allocation just in order to make this simple demo work, and this motivated me to get the code I was working on running as soon as possible. Unfortunately, I couldn't get it working before FOSDEM this year, but Luc and I coordinated to finally put it all together this past weekend, so now we can both look forward to the future - Gallium3D!
What does motivate you to work on these projects?
RC : Why let the desktop crowd have all the fun with open source drivers! But seriously, I see that tablets and phones and these sort of devices are replacing somewhat laptops and desktops. So I think that it is important that the free and community developed graphics stack, toolkits, etc, can play on these devices. The biggest or at least most immediate stumbling block so far is availability of graphics drivers. We see a growing trend of hacks to use android binary blobs, which really I think is a symptom of lack of open source graphics driver options.
SM: I like taking things apart to see how they work, not just hardware but software too.
As various details about the Raspberry Pi started coming out I became interested in the various details like the boot process, with the binary blob running on the GPU being in control. Various people on the forums claimed it would be impossible to reverse engineer, modify or replace the binary blob as the Instruction set was unknown. Instead I stumbled into Herman Hermitage on IRC and we set upon proving them wrong.
Basically I do it for fun. But the fact the open source Raspberry Pi has this massive closed source binary blob does annoy me.
WL: General interest in GPUs and reverse engineering, and figuring out how hardware/software works. I really cannot stand mysterious binary blobs for some reason :-) It is completely unrelated to my job.
Also the GPU is the only part of the Freescale i.mx6 SoC that has closed-source drivers, and for Marvell Dove the one-but-only (the VPU is also closed-source). It would be nice to have it working with full open source, so that it is possible upgrade kernels/ABIs, troubleshoot bugs and performance issues ourselves instead of having to wait ages for the vendor to contact Vivante and for them to release a new driver.
MG: Why do I do it? Well, mainly because reverse engineering is fun (even though I tend to take breaks from it if it gets too frustrating). There are other reasons why a free firmware for the Raspberry Pi would be nice to have, e.g. to be able to use it as a second realtime core - but for me, is is mostly because I enjoy playing with the hardware.
TR: Primarily I've been enjoying the learning experience. Computer graphics have always fascinated me and I like finding out how hardware works. On the other hand I've also experienced the usual problems when dealing with binary blobs, so I'd like to eventually see an open-source driver for the Tegra. So far it seems like the best way to do this is by reverse-engineering. I hope that will change, though, and that more vendors will eventually start contributing documentation and perhaps even code to the open-source drivers.
EF: Definitely the feeling of accomplishment; it's fun when you start to realize how a relatively complex system works. I also enjoy reading a bit up on details on the Linux graphics stack.
HH: I think its the child in me - I just have to know how it all works, and of course we all enjoy exploring the limit of just what is possible.
It's also exciting to think of a future where we could put a team together and create an Open CPU and GPU. Looking at all the great work going on with these SoCs, one can't help feel but that day is getting closer.
I'm also a huge fan of Alan Kay's work. So from a philosophical point of view, I think its important activity to examine these SoCs and attempt to communicate the fundamental simplicities to the up and coming generations of engineers and scientists. Its all about opening the door of difficulty and showing the true simplicity behind that door.
LV: As said before, trailblazing is something that apparently gives me great satisfaction. And, of course, there is the thrill of finding out things through hard labour and deduction, especially if people do not really want you to know such things.
There were several concrete and personal reasons for me to do this though.
First off, after the RadeonHD project, it was clear that i did not want to waste further time on the x86 linux desktop. It's a political swamp, where correct insights, hard labour, and actual results are all irrelevant if you do not belong to the right political group. And the lengths this group would go to to affirm their own supposed awesomeness shocked even me. The consolidation that happened in the x86 desktop graphics market, with the few remaining players now sticking hard to their respective position towards open source software, gives little option to do real work and achieve results outside of what is dictated by corporate politics. It was time to move to something new, and the completely level and open playing field of ARM GPUs was exactly that.
Secondly, when looking at the way the world was turning, everything was becoming ARM, and (almost) everything ARM was running some form of linux. There was only one key reason why people were not running a proper full linux on their ARM devices. One key factor that made this nigh impossible: binary userspace drivers. Someone had to get stuck in, prove that this is not beyond reach, and change this stalemate forgood. This was a perfect wall for running a very hard head into.
Finally, while I have covered most sides of graphics driver development in the meantime, I had not done any real 3D driver development before. When I was the only person who cared about boring modesetting, I often had to hear how difficult 3D graphics hardware is. Some part of me always knew that that was an excuse to not deal with the boring but highly important bits, but now I got to prove that this is not the case. If the hardware is sane, a 3D driver is sane as well, but even when the hardware is sane, the sheer number of combinations possible make display driver development impossible to get absolutely right for everyone. A few FPS less is not the end of the world, a display that does not show any image means that the user will turn and walk away. And it indeed was an excuse for not dealing with what people really needed at the time.
Looking back, I am really really happy with how this is turned out. I had hoped that we would have, in the meantime, convinced one hw vendor already (which then means that a few others will follow), but that still has not happened, although I am still hopeful. I am really pleased however with how lima was the motivator for several people to contribute either directly to lima, or for tackling a similarly insane task on their own. We now have a group of highly clued and dedicated people who devote their spare time on a task they believe is right, and which they enjoy doing. And people just do their stuff, there is no politics hampering anything, we are all too busy fighting our binary blobs and hardware to spend time doing anything else. And we are all hurtling rapidly towards being actually useful for users, I am sometimes shocked to see what progress some guys seem to make overnight. Two years ago, when I started down this path, I couldn't have dreamed that such a marvelous outcome would be possible.
CA: Sometimes, that's a question that I ask myself :). I think a large part of it is simply the desire to figure out what's going on - can I put this together? I think that motivation helps especially with reverse-engineering, since it feels like a game where you're solving some particularly hard puzzle - and it's a fun game too, although frustrating at times. As for writing the compiler, the motivation is more extrinsic. I'm taking the information I learned through reverse-engineering and applying it to a useful end, i.e. writing an open-source driver for the Mali. I want this driver through a frustration with the current situation in ARM graphics (although some vendors, especially Nvidia, seem to be getting a clue and things are starting to progress), but also it's a fun project to learn a lot about graphics drivers and graphics in general, a lot more and much faster than I could have done through working on a desktop, x86 driver. In my case, I didn't even need any actual hardware to start contributing, since ARM helpfully provides an offline compiler that can be (ab)used for analyzing its generated output. And even for projects that do require hardware, the stuff itself is really cheap to get compared to a desktop setup, where you typically need two full-fledged computers (one for running the experimental code, and one for debugging it) and several expensive, power-sucking graphics cards in order to test your code.
What was the starting point of your development? Were you able to find any documentation or code to help you through your reverse-engineering journey?
RC: Well, I actually started on 2d first. adreno/freedreno is actually about two distinct graphics accelerators, there is a 3d core that is derived from radeon, and a 2d core which really has nothing in common except sharing the MMU. But since they are using roughly the same ioctls for command submission, this more or less applies to both.
The first step was kind of boring.. Just writing some simple test programs (plus figuring out how to link against android bionic/libc with a normal glibc gcc toolchain, since I don't actually have linux/glibc versions of the blob drivers). Once I had some tests, then I fired up strace to get an idea of the interfaces between kernel and userspace, and started looking at the kernel driver code and corresponding ioctl structs. And with that understanding of the kernel<->userspace interfaces, I wrote my creatively named libwrap.so, which is a shim that can be LD_PRELOAD'd and log the cmdstream, keep track of gpu buffers, etc. And based on the log files generated by libwrap, I started figuring out how to parse the cmdstream dumps.
On the 3d side, I had some context save/restore code in the kernel driver from the msm android kernel tree. That gave me a pretty good starting point for parsing the 3d cmdstream dumps, but It didn't really help for 2d or figuring out the shader ISA. And also, due to the AMD/ATI heritage, I spent a lot of time reading the public r300/r600 docs. Some way into the 3d, after I already had the rotating cat demo and my own shader assembler, I found a freescale kernel which had the original amd-gpu kernel driver for adreno a200. (I have been mainly working on the a220, fwiw.) But this had definitions for the bitfields in most of the registers and shader ISA instruction encoding. So it would have been useful to have much earlier, but all the same I was able to correct a few mistakes in what I'd figured out about register bitfields and shader ISA... few places where I was off by a bit or two on sizes for some bitfields, etc.
I guess it is kind of a detective game. You search for hints and clues everywhere.
WL: The starting point of my development was a rooted Rockchip 2918 tablet a friend gave me. I wanted to hack on it, and when I first started playing with android rendering internals, I hoped it'd have any of Mali, Adreno, etc GPU. But it turned to be this (new to me) Vivante GPU. As there was no project making open source drivers for those yet I thought it'd be fun to see how far I'd get. I also wanted to do some experiments with binary static analysis and symbolic execution to automatically map GL state to HW state.
In contrary to the userspace, the kernel driver is GPLed, which is great and I found various instances of the GPL drivers in SoC kernel trees. Unfortunately at first, none exactly matched the interface of my device. It took some puzzling with fields to get that part right, but it helped me along greatly in understanding the hw and kernel interface.
After a month or so I found out that there were some more things already out there. There is the (defunct) gcx project which was working on an accelerated 2D driver, and some TI OMAP documentation about the 2D hardware. Nothing about 3D, though (that's the same as with many other GPU vendors). This did help me fill in some gaps in the command stream format as the 2D and 3D pipe share the same DMA interface.
EF: I simply started by intercepting writes and ioctls to /dev/nvhost-gr3d, and looking at what happened. Luckily, NVIDIA had released the source code for their kernel module, so I could more or less figure out what the ioctls did. Also, most of the CDMA (the command-stream DMA) opcodes were somewhat documented through usage there, so I could figure out how hardware-registers were written. Then, I looked through a lot of dumps of carefully crafted GLES-programs to see how things worked.
At a later point, NVIDIA also released documentation for their 2D unit, and some source code using it. This has been quite valuable so far as well.
Another thing that was useful, was that NVIDIA has released an off-line compiler. I used this to begin with when disassembling the shaders, but now we are tapping into their on-line compiler instead.
CA: I had done a little graphics stuff a while ago, but the memories were rather vague - so the first step was to read the ESSL 1.0 specification from front to back to familiarize myself with the operations, features, etc. that I would have to figure out while reverse-engineering. It was actually surprisingly readable, being rather simple compared to desktop GLSL and free of legacy features. Then I compiled some simple shaders using the offline compiler, trying to figure out the format of the MBS files that it produces. Luckily, it was pretty similar to the format of the online compiler, which Luc had already reverse-engineered, and I quickly wrote a simple utility to dump the shader binary itself (i.e. just the instructions themselves, without any of the metadata the compiler inserted for the driver to use). Combined with a hex editor to look at the resulting file, that was all I used to start out. Luc had already figured out a small portion of the vertex shader ISA for his linking code, so that helped us a lot. But perhaps the most help came from Eric, who had internal knowledge from working the Mali team, and gave us hints whenever he could and stopped us from going down dead ends.
Unfortunately, there wasn't much code/documentation we could use when we started out other than what Luc had already done - so we pretty much started from scratch. What we did, though, was write a lot of our own code and documentation. I think that when reverse-engineering something, especially something as complex as the Mali shader ISA's, it's critical that you write all your guesses down. And then, when/if they turn out to be correct, you have to immediately turn around and write some code, something that abstracts away the details and keeps you from having to mentally parse all the stuff you've already figured out when you're trying to reason about something that's still unknown. I was lucky that Ben, who was paid to work on it, could spend more time doing the boring things, such as implementing a disassembler to help me while we were figuring out the ISA.
LV: My starting point was the goal of freeing ARM GPUs, and then figuring out a plan of attack from scratch. And it started with picking a good candidate, and that ended up being ARMs own Mali.
I first had to get mali hardware, but the only hardware available at the time was a Mali-200 on the cheap and horrible Telechips tablets of late 2010. Only a pretty awful kernel tree was made available, and all we had which was runnable on actual devices, and which included binary drivers, was android-2.1. This did not even have LD_PRELOAD and i wasted my spare time trying to hack ELF binaries directly to achieve an effect like LD_PRELOAD. This and the android bionic linker did not make that too successful an endeavour. Luckily, towards the summer of 2011, an android 2.3 image became available for a telechips device, and then things finally started moving along.
TR: My first steps were to write a DRM driver because the display driver that NVIDIA provided was something between FB and DRM, with a completely non-standard interface. Once the DRM driver was merged into the mainline kernel (3.8), NVIDIA actually started contributing patches to support the host1x infrastructure on top of that. By the looks of it the patches should be able to make it into 3.10.
Those patches can be used to implement 2D and 3D acceleration. NVIDIA has provided documentation about the host1x and gr2d engines, so those parts don't need to be reverse-engineered. While NVIDIA plans to port their binary drivers to the DRM interface, there are no indications so far that any of the code will be open-source.
Meanwhile I've been spending some time on making the grate code work on top of the DRM interface and I have a couple of patches to the kernel which allow this to work. The nice thing is that grate now has an abstraction layer which works on top of both the L4T (NVIDIA-provided) and upstream kernel interfaces. So our very basic 2D and 3D test programs run unmodified on both L4T and upstream kernels, where upstream kernels even support on-screen display.
MG: There was no interesting code available, but the documentation actually was helpful 1-2 times. Not because it documented anything interesting, but rather because, as it should be in a properly designed system, some registers looked similar to ones documented in the manual (mainly timer related registers). The most important help though were the hardware patents - they sometimes contained a fairly precise description of instruction encodings or other aspects of the programming model.
How does your project relate to the other projects in the same field? Were you able to reuse some code, design idea... ?
RC: I think I took a relatively similar approach to what Luc had done on mali.. and I think it is probably similar to grate/etnaviv. I guess it is pretty standard technique for r/e'ing a userspace blob. I think I probably stole a piece or two of util code (like bmp dumper), and some of my early gles tests from lima. Probably a lot more useful was the idea sharing and bouncing theories about how things work on #lima and #freedreno.
I've switched to using envytools for a3xx. I didn't use it for a2xx as I was pretty far through the process of writing my own hand-written parsing when I discovered envytools. But I guess since I'm starting again with a3xx, it is a good time to make the switch.
WL: I used tools from nouveau (envytools). As they already defined and documented an excellent format for mapping the state space, this allowed me to focus on doing experiments for finding out the bits instead of thinking how to structure things. This shows that, even though there is a wide difference in GPU architectures, there is also a lot in common between different endavours and we can benefit from standardized tools.
Overall I followed the same approach as Lima, by first running GLES demos on an off-screen buffer and dumping the output to bmp files, intercepting and later replaying the command streams and memory deltas. These can then be analyzed to see what the blob does, and why. This allowed me to avoid the Android surfaceflinger labyrinth and gory framebuffer details at first.
While figuring out the command stream I wrote increasingly higher-level utility functions to generate it, and these converged on a gallium-like interface that will eventually become the driver.
EF: I stole more or less my whole approach from the Lima-driver. I don't think we've used any of their code, though. On my part, most of the code has been relatively simple throw-away stuff anyway.
I also used the rnn/rnndb stuff from Nouveau's envytools for my earlier tegra-re code-base, but it's not working with the grate-code (yet?).
CA: For the compilers, I've stolen/used a lot of good ideas that I learned while doing my investigation into compiler theory; this is a well-established field, so even if it's complicated, at least I'm not exactly the first person to be implementing these algorithms. The register allocator I'm using, for example, is based on the one in Mesa that's used by the Intel and r300g drivers if I understand correctly.
LV: All of the lima code is written from scratch, there were no examples for me to follow.
I had to write the wrapping of relevant libc calls myself. This is all relatively straightforward, but there was nothing similar out there yet. I had to also write simple GLES/EGL tests myself, so i had to first read up on that as well. Then building all of this and running it on android, that was quite a pain as well, and no-one was using proper makefiles to do it, everyone just seemed to use the android build system.
SuSE's yast and snapper developer Arvin Schnell created some further tests for me, and wrote the Android demo app. This app goes fullscreen, stops rendering to fullscreen, and just runs the selected program. This meant that i could render to the FB without android interrupting, and that i could just select different tests and use the back-button to exit them. This was pretty important in the android only world.
TR: When I first started, Erik had already done a lot of work. I think the framework (if one can call it that) is in many ways similar to what lima and freedreno are using. I think there just aren't that many different ways to do it.
There are frameworks for the proper drivers (DRM, Gallium, DDX) so I've been looking at a lot of other drivers for inspiration. For the reverse-engineering parts there isn't really anything that I've reused. The envytools have already been mentioned a couple of times and I may still look at them eventually but I haven't had the time to do so yet.
The reverse-engineering projects have a lot of commonalities, but so far no common framework has materialized so far. I'm very new to this whole area, but I have a feeling that something common won't be easy (or even not possible at all) to write. But maybe we don't need any common framework. One of the most helpful things to have is a community of people to talk to. Everybody in the #lima channel on IRC has been very giving good advice.
What was the most difficult point you had to overcome since you began to work on your project?
RC: Hmm, I think r/e'ing a gpu (or possibly, even just writing a gpu driver) is a never-ending series of most difficult points to overcome ;-)
There were plenty of times where I was stuck on something or other for a few days or a week. I think the only really one where I had given up and then come back to it several times was on the 2d side.. when I was trying to enable batching of multiple blits in the exa driver, I was hitting cases where I'd either get too many or two few irq's (leaving me not knowing which blits had actually finished). Which turned out to be a simple issue.. a size field in hdr for the cmdstream which was not supposed to include the size of the header.
WL: The reverse engineering part was mostly smooth sailing. There were some more difficult derived bits, and cache flushing/synchronization is hardest to get right. I had quite some WTFs with that, and I'm still not sure it's entirely correct. There is also the "context" handling which tracks the entire state of the GPU in a big command buffer, to be restored at a later time. This is not handled yet and will certainly cause some issues when multiple processes are using the GPU at the same time. But I got to say that Vivante GPUs have a sane and well-designed hardware interface (contrary to their kernel interface,... but that's another story). Like Adreno, the shader ISA is straightforward, and shaders are unified.
Another recent difficult point was learning how Mesa and Gallium works. It's a huge (but well-structured) project, so it took me quite some time to get a hang of how and why CSOs are used, and the GLSL-to-TGSI-to-eventually-GPU shader compilation pipeline. I also had to revisit some bits of compiler theory along the way. Luckily the developers in their IRC channel have been really helpful.
The biggest difficulty is remaining focused in this modern age full of distractions and other interesting projects. Even though I'm experienced at reverse engineering it's a lot of work for (effectively) one person. A lot of people troll about open source projects "when is this driver finally ready?!?" without understanding how much work goes into it, and that it's all unpaid.
EF: I would say keeping up motivation. There's times where just nothing seems to make any sense, and it's hard to stay motivated then.
CA: Well, there were a lot of difficult points... I've never found an ISA as insane as what's in the Mali 200/400, and everyone I've ever talked to who's looked at it has told me the same thing. I think the lima project was lucky to have someone like me that could focus on the crazy shader stuff while Luc went ahead and got everything else done. It would have been impossible for anyone to do this much, this fast, while having to RE both the command-stream and the shaders; instead, we had a rather nice division of labor where Luc and I could focus on totally different stuff and then put our work together where it mattered (most recently, with the Q3A demo).
As for the compilers, I've definitely had the most difficulty with the GP. There were a lot of difficult parts, but the code that stands out to me as the hardest to write was the scheduler. I tried to structure it as cleanly as possible, so that I could try a different approach later while changing as little code as possible, but I think that plan was dashed as I began to comprehend the number of corner cases, complications, etc. Of course, it ended up being very difficult to write, and to this day I'm surprised that it works as well as it does - in all the (admittedly simple) shaders I've compiled, it seems to have performed as well as or better than the scheduler in the binary compiler. Hopefully, it'll continue to perform that well with more complex shaders and I won't have to dive into that code again :).
LV: Android and the lack of LD_PRELOAD was a pretty big hurdle. Time, and not staying focused is probably the other. Plus, the Mali-200/400 is not the easiest chip, but once it runs, it's fast and you do not have to play around much to optimize things further.
MG: As all we have right now to trace and reverse engineer hardware initialization is a small and buggy emulator, we keep hitting emulator limitations and annoying timing issues - up to a point where the software just is not able to initialize the hardware anymore because everything is running so slow. We don't have a solution for that either...
Luc, it's not the first time you mention or use Quake 3 to show your result. Is there a technical reason for this choice?
WL: I think because Doom3 is still too heavy for ARM-based devices and embedded GPUs :-)
LV: For several very good reasons.
First off, the process of reverse engineering a graphics driver is understanding job submission and wrapping things fully, so that you can see things changing in the command stream, submission order, and the rendered result. You then take very simple tests, and re-implement them in open source software, and then gradually bump functionality until you have figured out most things of the hardware. Only then, in my view, can you move to writing a real driver.
Q3A was just the next step after supporting multiple textures and multiple programs and multiple frames. It finally was a real world program where the tests were not written by an openGLES newbie with very specific functionality in mind. It's actually a program people use.
Q3A is a 1999 game which has been open sourced by idtech. It has built in benchmarking, and its timedemo has been a well known benchmark for 14 years. It also makes for a nice little demo and a nice crowdpleaser.
Q3A was ported to openGLESv1 and therefor supported by the binary driver. There is no point working with applications which are not running on the binary driver first, because then you are no longer reverse engineering a graphics driver, you are porting an application, or worse, a whole game engine.
By having to depend on the binary shader still, i rather liked the fact that Q3A was written for fixed function engines, and only needed a few very simple shaders to run with.
Q3A was also doing many things that i never had to deal with before. All sorts of parameters influencing rendering had to be set all of a sudden. I had to fix up job scheduling to get performance up, as i finally had a benchmark to go measure some things (in the first approximation). Then, and this was quite lucky, since Q3A is a GLESv1 program, made for fixed function engines, the mali binaries were doing a few things which i would've never discovered when using GLESv2 applications.
All in all, Q3A, while it might seem old and boring to some, was the perfect target, and the absolutely correct next step for the lima reverse engineering. And this is part of reverse engineering is about: pick a target, and figure out the steps in between. Pick your target too ambitious, and you'll get lost in the swamp.
While Doom3 has been open sourced in the meantime, no-one has managed to make a correct openGLESv2 port of it. Oliver sadly left us before he could finish it, a great loss for us all. Doom3 would run a lot slower on our hardware, but since Q3A runs fully fragment shader bound on mali, it barely uses any vertex shader time. If Doom3 has 5x the vertex shader load, and 2x fragment shader load, it will run acceptably on the mali-400. It would've been a nice next target, but we would learn a lot less from it, but the amount of work needed would be less as well.
So those people who roll their eyes at Q3A and who wonder why i would spend time on porting and supporting such an old game, they simply do not know any better. They have not approached this from a reverse engineers point of view, nor faced the facts and realities of existing GLES support. Q3A was the perfect mistress for me and the lima project.
Rob, freedreno is the first of this new crop of SoC GPU driver to made its way in mesa/gallium. Yet development went public only one year ago (first commit on March the 24th, 2012). How did you achieve so much, so fast?
RC: Well, I guess I got a bit lucky on a few counts.
- context save/restore code in kernel, and similarities to r600 in some places helped early on to get an understanding
- unified shader ISA, so didn't have to figure out two different instruction sets for vertex and frag shaders.
- the a2xx has shader ISA has a very simple instruction encoding, with a single format for all ALU instructions. (Plus two FETCH instruction formats, and handful of CF instructions.) And nothing like the crazy of mali.
fwiw, a3xx has a new, slightly more crazy ISA, which I now have a disassembler for.. and also in the early stages figuring out the a3xx registers, since they shuffled everything around compared to a2xx.
LV: Rob is a highly clued and amazingly dedicated and focused graphics driver developer. His fast progress is mostly explained by that.
On the other hand, now that there are 3 other projects working with sane graphics drivers/hardware, and it seems like we lima guys picked the most complicated target. For what Rob managed to do in about 2000 lines of fdre, i needed about 5000 on limare. Since i did not want to be too encumbered by Mesa while figuring things out properly, i have implemented things a lot further in the prototype driver before thinking about going into mesa. Now i topped 10kloc in limare, but i am still not doing any superfluous work for supporting Q3A.
I know that Wladimir is actively working on a gallium driver. Is there any other on their way?
WL: In my latest commits, the shader compiler is generating code from TGSI. I'm pretty close now. All the parts are there*. (some assembly required*)
Just kidding though, after the GL binding the real work only starts. Debugging, optimizing, profiling etc.
EF: No code yet, but we're looking into it. So far it looks quite possible to do, so... Let's hope it's not so far away.
[editor note: update] In the time since I wrote this, the situation has changed a bit (as Thierry said in his respose). The Gallium driver is also slightly more capable than what Thierry said; it can currently clear the screen. That's not a very capable driver, but at least it's the first steps.
CA: Be patient! It's been on my todo list to start working on Gallium for a long time, and recently I've been talking to some of the Mesa/Gallium developers about it. The thing is, I think we could write a simple, direct translation pretty easily, but taking advantage of some more advanced features of the hardware (especially varying/uniform packing), as well as supporting the binary compiler as a backend for comparison/optimization purposes, might require some more changes to the Gallium interface that are going to be somewhat harder to upstream. So I've been starting to look at the Gallium code, and trying to get my hands on some Mali hardware that can run full-fledged Linux (I may have to end up buying it for myself, if it doesn't show up...); right now, my time is more limited anyways, but hopefully I'll be able to hack on it more once it arrives.
LV: I am now cleaning up my Q3A support code, and Connor just got his OGT to generate the shaders for Q3A. We can now start looking into supporting Mesa.
I will not allow getting fully assimilated into gallium infrastructure. We have an easily accesible binary compiler, and we have just proven that we can have performance matching the binary driver. We have to keep the possibility of using the binary compiler, so that we can verify command stream and shaders separately, and make sure that we can deliver performance matching the binary stack, or at least figure out where mesa or gallium, or our implementation for it, is slowing us down.
We, very deliberately, delayed the start of our Mesa work, and we are fully aware that this will mean more work for us, as Mesa is still a 1990s monolith, with developers thinking like the borg. People have to wait longer for us to deliver something directly useful for them. But we will get them performance which generally matches the binary driver, instead of being forced to settle on 50% or so.
TR: I've recently started to look into adding a Gallium driver for Tegra. So far I have something that can run Weston, only with no output whatsoever because the driver is only stubbed-out. However it gave me some more insight into what needs to be done to make it work and what we still need to work on.
The next step will be to make clearing work, because we know how to do that pretty well. So I use a simple OpenGL ES test program that uses GBM and just clears the render buffer and displays it. This can be used to fill in the missing pieces of the Gallium driver. After that we can start thinking about adding 3D support, but that will require to generate shader instructions from TGSI, which we can't do yet. Erik has been making a lot of progress in that direction, so I'm pretty sure it will eventually be made to work.
MG: Even if we knew how to enable the HDMI output (which we don't) or knew how to enable the ARM core (which we don't) or enable the 3D hardware (which we don't), we still wouldn't have any C compiler which could be used for writing the actual backend software...
Let's be a bit prospective. ARM-based devices took over the world (ARM says that 16 million ARM processors are sold each day). Yet they're still representing a small fraction of the desktop world. [A] Do you think this will change? [B] What impact would have your work on such a change? [C] Finally, how do you envision the future of the SoC GPU market?
RC: Well, in years gone by, I used to assume we'd see arm growing up into desktop/laptop, by way of the netbook. Although the netbook thing never really seemed to take off. (Which makes me a bit sad, I really just can't get into typing that much on a touchscreen or using android for anything serious, but then again I probably don't represent the average consumer.) Maybe there is another shot as the 64bit/armv8 stuff starts materializing. On the other hand, I think the market shifted, so maybe that doesn't matter anymore. For light-duty content-consumption there has been a big shift towards tablets.
However this plays out, in the end I think we will see more SoC devices. And I think it is important that open source community developed graphics and "desktop" stack is able to run on these devices if we ever want to bring proper linux to the mainstream. It will be an interesting journey for folks used to the desktop world where everything is pretty well standardized, compared to tablets/phones where still for the most part every device (or at least every vendor) has their own kernel branch. Although hopefully this will improve over time with the arm cleanup and kernel consolidation work going on, and with more of the SoC vendors caring about upstream. What nvidia is doing w/ tegra drm/kms driver is a great example... I hope more vendors follow their lead.
As far as the future.. well, I have no crystal ball. What I can say is that over the last few generations of SoC, the GPU has been taking up a larger and larger percentage of the area of the chip. I'm not sure where it will peak out, but if the display resolutions keep going up, then that is more pixels to push. And because the GPU is becoming worth more of the total cost of the SoC, it is going to be competitive between the independent IP vendors (IMG/vivante/arm).
On the SoC side, it costs a great deal of money to produce a SoC, and time to market is everything in the mobile space... If you are late to market with your generation N+1 SoC, then you can kiss that investment goodbye. This is why we've already seen some consolidation in the mobile market, and I expect we will see more. Which will leave IMG/vivante/arm fighting for a smaller # of customers. If Apple ever decides to develop their own GPU, then IMG (at least the GPU end of their business) is in some real trouble. Vivante seems to have a nice little niche with freescale and marvell who seem to focus more on the slower paced industrial and automotive markets.
WL: [editor note: Wladimir broke his answer in three parts - each part below relates to the corresponding sub-question.]
[A] People are very bad at prediction. It's often the black swan events that eventually determine the big outcomes. So here is my try, warning: handwaving ahead.
The attraction of ARM is mostly low power usage and low prices. My feeling is a lot of buyers of desktops don't care about that much, at least at the moment. As long as you still see adverts for gaming PCs with 800W power supplies, and expensive, noisy, GPU cards, I think x86 is a good fit for that. Additionally, Intel and AMD have excellent single-threaded performance, which is important for the still mostly non-parallellized, inefficient software that usually runs on it :-)
For laptops I do see an ARM market growing, especially for people that care about long battery usage, those that mostly use the laptop in transit, instead of as a desktop replacement. Also with the continuing economic troubles consumers have less to spend and there's a big low-end segment where ARM will likely get a foothold. On price, Intel cannot compete with SoCs such as i.MX6.
[B] That's very hard to say. I hope my work will help people that are tinkering with SoCs, using them to build useful or creative things. At least one person is building a laptop based on i.MX6 SoC, and it's already possible to use a Cubox or GK802 HDMI stick as a sort-of desktop (and my driver could help there). But impact on the desktop market? nah...
[C] GPUs become general-purpose parallel coprocessors. This has already been happening for a long time, this trend will continue in the embedded space. It will happen because there are many reasons for fast parallel computing on modern embedded devices (for example computer vision tasks in robotics) and they only sometimes involve rendering.
Additionally there is a growing demand by developers for more control and programmability to implement radically different custom rendering. As an example, to do 'race the beam' rendering for low latency applications in augmented reality, see Latency – the sine qua non of AR and VR.
TR: Most ARM SoCs target the mobile, automotive or embedded market, so it's not surprising that not many desktops come with the technology. However more and more SoCs are starting to provide interfaces that would make the chips more useful for desktop usage. There are quite a few vendors that add PCIe or SATA controllers, which makes for some nice use-cases. One such project is CARMA. While this isn't strictly a desktop type of machine, it certainly provides a lot of the peripherals that you'd expect from one. HPC is a market where ARM is very interesting and some interesting projects use these low-power SoCs to build large clusters. There are also companies like Calxeda, which provide servers based on ARM technology.
On the other hand the traditional makers of desktop CPUs are starting to invest into the embedded market as well. Intel has been doing this for a long time and they are almost reaching competitive performance-per-watt now. Both camps have broken out of their traditional markets and are now competing on both fronts.
Both embedded and desktop have also been converging for some time. SoCs now come with high-performance GPUs, while desktop CPUs and GPUs at the same time need less and less power for the same performance. I don't think either one will eventually die out, but the differences certainly become smaller. The embedded and mobile markets will no doubt grow over the next years, but the need for high-performance desktop PCs will not go away. There are some things you just can't do on a mobile device. Reverse-engineering GPUs for instance.
I don't think my work will have much of an influence on those future developments at all, though. My hope is that eventually more vendors will start to provide open-source drivers from the start, because I think that everybody wins that way. If I can contribute in some small way to make this happen, that would be great.
EF: I'm more of a techie than a visionary, so I'll keep my answer short.
The desktop market seems to shrink while the mobile/tablet marked grows, both in size and in complexity. I don't think there's any reason this shouldn't continue, so I expect to see ARM-based laptops and similar devices more and more in the future.
As for the other questions, I don't think I'd like to make a prediction. Time will tell ;)
MG: Seriously, I have no idea. Until recently I always thought that the amount of legacy software would make any move away from x86 impossible, but things might have changed due to massive development in the Android world. However, by far most android software isn't usable on desktop computers either, so this certainly will take quite some time.
LV: I think that by now, the others have covered all parts of this question, and whatever i reply would only end up repeating what was said earlier.
Maybe i can instead provide a soundtrack for this question instead. Google for "The Crystal Method" "Wide open" from their 2004 release "Legion of boom": "I have been informed, that it's totally wide open"
... And we made the future that open :)
[editor note: I googled it for you, so here is a video]
 for those who don't know: doing plumbing in the Linux world refers to the job of setting up the necessary tools to correctly run a full Linux-based system.
 Instruction Set Architecture
 i.e. the status of the whole Videocore project.
 of the Vivante GPUs
 Originally, this was an acronym for Tungsten Graphics Shader Infrastructure; at that time, Tungsten was one of the company that help support of both mesa and gallium (TGSI is an important part of gallium). Tungsten was then bought by VMWare and no longer exist (though some high profile allumini of Tungsten still work for VMWare). The acronym now stands for Tokenized Gallium Shader Instructions -- thanks to Connor for the tip
 Ben Brewer is also the creator of the Open GPU Tools project
 Intermediate Representation ; the IR is what a compiler use to represent the program it's compiling. To make things a lot simpler than they are you can view this as the result the compilation that is used by the code generator (the compiler backend) to generate the assembly code. What one call IR may depend on how one uses it. Mesa might consider TGSI as an IR, while the gallium drivers consider it as their source language. Incidentally, the same distinction applies wiith LLVM: its IR is viewed as a source code language by the compiler backend. Here is what Conor says on the subject: In case it wasn't clear, TGSI is part of the Gallium3D interface; it allows various state trackers (such as Mesa for OpenGL support, the vega state tracker for OpenVG support, etc.) to pass shaders to the various Gallium3D drivers (radeon, nouveau, softpipe, llvmpipe, freedreno, etc.) in a standard format. Calling it an IR is therefore a little misleading, since it isn't designed to be operated upon internally as part of a compiler; all drivers must translate TGSI into their own internal IR (for lima, we have 3 IR's, called pp_hir, pp_lir, and gp_ir).. For further information on TGSI and IR, you can read this blog post by Zack Rusin
 since you might read this article long after its publication: we're talking about year 2013.
 memory management unit
 the msm android kernel targets Qualcomm-based android devices, including the Nexus One and Nexus 4
 GPU driver for the "classical" NVIDIA GPU - i.e. not the GeForce ULP found on the Tegra SoC
 Quake 3 Arena
 as said above, the interview took a few weeks from its start to its completion.
 see below
 packing helps to reduce register use on the GPU ; you can refer to this document and this book sample (pdf) for further information. The book sample is taken from Open GL ES 2 Programming Guide, by Mushad et Al.
 the question has been broken in three sub-questions by some of the respondents.
 link courtesy of Wladimir
 link courtesy of Wladimir
 link courtesy of Thierry.