Imagine some product team inside Google wants 100,000 CPU cores + RAM + flash + accelerators + disk in a couple of months. We need to decide where to put them, when; whether to deploy new machines, or re-purpose/reconfigure old ones; ensure we have enough power, cooling, networking, physical racks, data centers and (over longer a time-frame) wind power; cope with variances in delivery times from supply logistics hiccups; do multi-year cost-optimal placement+decisions in the face of literally thousands of different machine configurations; keep track of parts; schedule repairs, upgrades, and installations; and generally make all this happen behind the scenes at minimum cost. And then after breakfast, we get to dynamically allocate resources (on the small-minutes timescale) to the product groups that need them most urgently, accurately reflecting the cost (opex/capex) of all the machines and infrastructure we just deployed, and monitoring and controlling the datacenter power and cooling systems to achieve minimum overheads - even as we replace all of these on the fly. This talk will highlight some of the exciting problems we're working on inside Google to ensure we can supply the needs of an organization that is experiencing (literally) exponential growth in computing capacity.
John Wilkes has been at Google since 2008, where he is working on automation for building warehouse scale computers. Before that, he spent a long time at HP Labs, becoming an HP and ACM Fellow in 2002. He is interested in far too many aspects of distributed systems, but a recurring theme has been technologies that allow systems to manage themselves. In his spare time he continues, stubbornly, trying to learn how to blow glass.