Skyline Panel
The SKYLINE workshop at ISCA 2026 assembled a panel of researchers - Amir from Google DeepMind, Joseph Torellas from UIUC, Jovan Stojkovic from UT Austin and Meta, Chaojie Zhang from Microsoft Azure Research, and Benjamin C. Lee from UPenn and Google to chat about the evolving landscape of cloud-native architectures, agentic AI, and the fundamental redesign of systems from the ground up. Here is my transcript from the talk.
Amir: Self refinement is something we really need to talk about. And just to put it out there, Bayesian inference on top of LLM generated templates gives us some real added performance gains. But let me close my opening with a reflection on life, I think we sometimes forget the human side of all this technical progress.
Joseph: Can we circle back to something practical? The cost-to-score ratio for feedback is a hot topic, and I think we’re not paying enough attention to it.
Amir: Agreed. And while we’re at it, programmability is another hot topic. Do we even need human programmability anymore? That’s the real question.
Amir: Here’s something I’ve been turning over in my mind, what’s the Move 37 in systems and architecture? What’s that counterintuitive, paradigm-shifting move that no human would think of?
Chaojie: I want to tie this back to user experience and expectations. Those directly shape how we design services, and we can’t lose sight of that in our architectural decisions.
Benjamin: Let me jump in on the hardware side. CPU microarchitecture is going to be interesting again. I think Spec benchmarks are no longer necessarily a good thing. The way data moves back and forth, that’s going to be the real interesting problem.
Joseph: And speaking of data movement, edge AI and data center capacity needs to go up dramatically. Conventional 40MW facilities won’t cut it if we’re stuffing GPUs in there.
Amir: That ties into what I’m seeing with agentic workloads. They are not static, they are building their own tools on the fly. How do we even analyze these workloads when they’re constantly evolving?
Joseph: Characterization work is absolutely necessary to understand an agentic workload. If it’s a very dynamic environment, there’s a lot of inter-core communication. My view is we need to push for simpler GPUs, hide GPU execution behind CPU work, and achieve deeper integration between GPUs and CPUs. As for data center design, I’d put less emphasis on raw GPU power and focus more on power efficient GPUs, with a lot of work on scheduling and dynamic movement of data.
Amir: But that brings up a huge problem. Over-provisioning is much harder now. Once you start running agents, they can spawn arbitrary sub-agents. How do you even provision for this kind of unpredictable demand?
Jovan: That raises a fundamental question, how do we characterize these workloads empirically? We need to account for different models, different tools, all the variables.
Joseph: Benchmarks will appear in a few years as people begin to understand things better. We’re still early.
Amir: I actually disagree about benchmark availability. There may not be many performance benchmarks, but we have SWE-bench, HumanEval, and others. We should start with those, characterize their performance, and begin there.
Jovan: I have to push back on that. We need standardized and reproducible benchmarks. How do you characterize performance with model non-determinism? It’s a huge confounding factor.
Benjamin: This is where we have to go back to basics, confidence intervals. We need to do things many times and measure properly.
Amir: And we need to profile tools like Claude Code, their I/O patterns, their data movement, everything.
Benjamin: *One thing I’m certain about, latency to first token response is going to be a top first-class requirement in the coming years. It’s not optional anymore. * Audience member: What about hybrid workflows? How do you see that influencing data patterns? How will local and cloud models cooperate?
Benjamin: We’re experimenting with local models. People are worried about token costs, so they use open models for most tasks and SOTA models occasionally. Your local models are fast, but that extra 50 to 100 milliseconds on your non-local SOTA models will be very noticeable. That latency gap is going to shape everything.
Amir: Edge AI and data centers are both critically important. I see data centers going the way of enterprise infrastructure. Edge will become essential for VLA robotics, that’s where the real action will be.
Benjamin: Historically, we’ve been really bad at getting people to think about what resources they actually need. We don’t know how to reason about model capability for specific tasks. The big question is, how do we route relevant requests to the right models?
Joseph: This is exactly why larger data centers with many users make sense, because with enough users, everyone ends up over-provisioned anyway. This problem is growing by an order of magnitude.
Amir: *We simply need more data centers. There’s no way around it. * Audience member: Are we heading toward a Wall-E world? What are your thoughts on that?
Benjamin: I’ll offer a contrarian view. There’s a lot of hype around gigascale data centers, but they are primarily useful for training. I don’t know how much more training versus reasoning versus inference we actually need. The goalpost is going to keep shifting. I think we’ll have a handful of gigascale data centers and everything else will be smaller and more manageable.
Amir: You just described an apocalypse scenario! But I think humans are going to be the limiting factor. I don’t think it’s going to be a disaster. There is definitely a lot of demand, but I’m a very positive person. There will be a shift in how humans work and interact, but I don’t think it will be that dark.
Amir: On a different note, tools definitely need to change. The tools that currently exist are created for human cognition and understanding. We need to redesign tools for LLMs, it requires a completely different interface and interaction model. How should tools be designed and how should they present their results to LLMs? We’ve added a lot of overhead to make understanding easier for humans, but I don’t think LLMs are very good at validating their solutions. We don’t need strict abstractions geared toward human design.
Joseph: That really is the Google view, isn’t it?
Amir: But the old infrastructure will still be useful for other things. In five years everything will be different, but it also won’t. The reality is most stuff is heuristic for local optimization. Humans cannot do global optimization because we can’t think about everything at once. LLMs may not have that limitation. Can we break the barrier between different optimizations and co-optimize multiple things together?
Benjamin: Look at Claude on Chrome, it’s so inefficient. It has to use hacks and debug tools just to do what it needs to do. Maybe we need to go back and redesign applications from the ground up for that reality.
Jovan: What should we watch out for with agentic hardware co-design?
Amir: Simulation takes a long time, and we’re dealing with much more open-ended problems. We’re not looking at one specific metric, we’re looking at tradeoffs. Evaluation length is incredibly challenging. For system problems, it’s not only IPC that matters. There are so many different signals to consider.
Joseph: Accuracy is hard. We need hardware that is 100 percent correct. If we use LLMs for hardware design, it introduces a lot of bugs.
Amir: Verification is really, really challenging.
Jovan: Let me ask for some advice for junior researchers.
Benjamin: The bar for acceptance at research conferences needs to go higher. We need more fundamental, interesting, and groundbreaking work. What questions can we ask that an AI can’t answer given enough time? We need to get people thinking about system organization as opposed to just building another widget. Don’t ask questions with a huge audience trying to answer them.
Joseph: Every ten years brings different challenges. That’s just the nature of this field.
Chaojie: And delegate appropriately. Know what to hand off and what to own.
Amir: Should we know the fundamentals or not? This is the paradox we face. I think this is the best time to know the fundamentals. Microarchitecture is so old, but defining research questions is getting so much harder. How do we teach with LLMs to be used productively by students? That’s the real challenge we need to solve.