← Back to projects

UPenn ESE 3700 · Spring 2026

22nm CMOS Datapath & Memory

Transistor-level design of an 8-bit adder and a 16×4 SRAM array in a 22nm process, covering the datapath and storage halves of any digital system. From Boolean derivation through Elmore delay modeling and SPICE validation. Projects from ESE 3700 at the University of Pennsylvania.

Process

22nm HP PTM

Supply

VDD = 0.8 V

Tools

Electric VLSI + SPICE

Project 1: 8-bit Ripple-Carry Adder

An 8-bit ripple-carry adder, built and simulated in Electric VLSI. The full adder's outputs are:

S = A ⊕ B ⊕ Cin
Cout = AB + Cin(A ⊕ B)

Both depend on A ⊕ B, which made XOR2 the natural primitive to build around. Compute it once, feed it into both the sum and carry paths. Several alternative sum and carry topologies were explored and rejected along the way; that analysis is in the full report. Once the topology was settled, the interesting question became which XOR2 implementation to use.

Two candidates were carried forward and compared head-to-head at the full 8-bit system level:

Baseline XOR2 cell built from four NAND2 gates
Baseline: four-NAND2 XOR2 (16 T). Every node fully restored, easy to analyze, conservative.
Transistor-level schematic of the optimized transmission-gate XOR2 cell with output buffer
Optimization: transmission-gate XOR2 with two-inverter output buffer (10 T). Same logic, 40% fewer transistors, and a shorter signal path. The buffer restores the TG's voltage output to a full rail-to-rail logic level before the next stage.
Full adder bit-slice schematic: NAND2 generates AB-bar, two XOR2 stages compute sum, two NAND2 cells compute carry
Unfortunately the only screenshot of the full adder I had saved is cropped a little.

The full adder bit-slice combines the two primitives. One NAND2 computes AB in parallel with the first XOR2 (P = A ⊕ B). That intermediate P feeds both the sum path (a second XOR2 producing S = P ⊕ Cin) and the carry path, where two right-hand NAND2 cells realize Cout = AB + Cin(A ⊕ B).

Sharing P between sum and carry is what makes this topology efficient: the most expensive intermediate is computed once and reused. Eight of these bit-slices (plus a half adder at bit 0) make up the full 8-bit ripple-carry adder.

SPICE Results

Side-by-side SPICE propagation delay comparison: TG-XOR (optimized) vs NAND4-XOR (baseline) across three test cases
Side-by-side SPICE waveforms. Left column: optimized TG-XOR design. Right column: NAND4-XOR baseline. The bottom row is full carry propagation (255 + 1), the worst case. Optimized settles 44 ps earlier.

Baseline Delay

160 ps

Optimized Delay

120 ps

Carry-Prop Speedup

−27%

Baseline Area

330 T

Optimized Area

240 T

Leakage Tradeoff

10×

The Lesson: When Elmore Lies

The most useful thing I learned on this project was not the topology exploration, it was what happens when your analytical model disagrees with simulation. Elmore delay predicted that widening the NAND2 PMOS to 2×Wmin would cut pull-up delay roughly in half when driving the TG-XOR2 input. SPICE said the opposite: sizing up actually made carry propagation worse. The first-order RC model captures the resistance drop from a wider PMOS but not the extra capacitance added to internal nodes. I reverted to minimum-sized transistors everywhere. The numbers above come from that version.

The Leakage Tradeoff

The optimized design wins on delay and area, but transmission gates create a partially-conducting path from supply to ground at the worst-case input (A = B = 1). That's a ~10× leakage penalty vs. the fully static NAND-only baseline. At minimum-leakage inputs the two designs are comparable. The choice between them depends on duty cycle: high-throughput, mostly-switching paths want the optimized cell; always-on, low-activity datapaths want the baseline.

DesignInput StateLeakage per Delay Period
Baseline (NAND4-XOR)A = B = 1 (max)0.025 fJ
Baseline (NAND4-XOR)A = B = 0 (min)0.016 fJ
Optimized (TG-XOR)A = B = 1 (max)0.21 fJ
Optimized (TG-XOR)A = B = 0 (min)0.019 fJ

Values scaled to each design's worst-case delay period (160 ps baseline, 116 ps optimized) rather than the raw 100 ps simulation window, so both designs are compared over one full clock cycle of their own.

The full write-up covers the design-space exploration (sum options S1 through S4 and carry options C1 through C3), Elmore derivations with extracted capacitance and resistance values, all schematics, and the complete SPICE validation methodology.

Read the full report (PDF, 25 pages) →

Project 2: 16×4 SRAM

A synchronous 16×4 SRAM (16 words of 4 bits) in the same 22nm process, targeting at least 500 MHz operation on a single clock input. The design spans the full memory system: a 6T bit cell, non-overlapping clock generation, row and column decoders, sense amplifiers, column drivers, and a voltage midpoint reference. It is graded on a figure of merit that combines bitcell area, power, and delay squared, so every sub-block has to be sized for both correctness and score.

Top-Level Architecture

The full memory fits on a single schematic. Address inputs A0–A3 are latched and fed through a 4-to-16 row decoder gated by φ2, driving sixteen word-lines into the 16×4 bitcell array. Four column drivers precharge BL/BLb during φ1 and drive writes during φ2; four isolated latch-type sense amps are armed by SAE = delayed(φ2·Wr) and resolve reads onto a per-column bus shared with the write path. A two-phase non-overlapping clock generator takes the single external Clk and produces the two phases that keep precharge and access strictly separated.

Top-level schematic of the 16x4 SRAM: address latches, 4-to-16 decoder, 16x4 bitcell array, four column drivers, four sense amplifiers, two-phase clock generator, and per-column read/write blocks
Top-level schematic. All control signals (PCHb, SAE, WrEn, the inverted Wr) come from the two-phase clock and the Wr fanout network along the bottom of the diagram.

Two design decisions drive the rest of the implementation. First, PCHb is wired directly to φ1, which forces precharge to finish before φ2 can rise. The non-overlap gap is what guarantees the column drivers and word-lines never fight each other. Second, the sense amp is an isolated latch-type pair armed by a φ2-derived SAE, so it only fires once the bitline differential has developed past its offset. Every other block exists to feed these two.

One Full Clock Cycle

The easiest way to see the whole design work at once is to overlay every relevant signal on one clock period. I ran characterization at 320 ps (3.125 GHz), which is a deliberately safe choice: about 7% above the ~298 ps absolute minimum found by binary search further down, so no waveform is sitting right on its own failure boundary.

One 320 ps clock cycle with phi1, phi2, PCHb, WL1, BL0/BLb0, SAE, and the sense-amp output Q/D0 overlaid, showing the four phases: precharge, non-overlap, access pre-SAE, and access post-SAE
One 320 ps cycle, split into four phases. During precharge (roughly 0–150 ps), PCHb is low and BL/BLb are pulled back up to VDD. A short non-overlap gap separates precharge from access. When φ2 rises, WL fires, the bitlines start to develop a differential, and a delayed SAE arms the sense amp only after that differential is past 50 mV. Q/D resolves cleanly well before the next precharge begins.

Functional Verification

Correctness was validated bottom-up, from the bitcell up to the full array. The most interesting test is mainTest2: two different 4-bit words written to two different addresses, then read back. This checks that the decoder routes the correct row and that storing one word does not disturb another.

mainTest2 SPICE waveforms: 1001 written to address 0000, 0110 written to address 1111, then both read back across eight nanoseconds
mainTest2: 1001 → address 0000, then 0110 → address 1111, then read both back. Each read matches the written word exactly; no cross-talk between rows.

How Fast Can It Go?

Minimum operating period was found with a binary search: shrink the clock period until reads stop matching the values that were written, then bracket the break point.

Clock-period binary search: left panel plots read-back voltage vs clock period and shows it transitioning through VDD/2 near 295 ps; right panel shows pass/fail bars across frequencies
Binary search for Tmin. The read-back voltage on a known-0 cycle sits cleanly at 0 V for long periods and snaps up to VDD for short ones; the transition crosses the VDD/2 decision threshold near 298 ps, setting fmax ≈ 3.36 GHz, well above the project's 500 MHz requirement. The 320 ps operating point used for the earlier waveforms and for the power measurement sits about 7% above this break point, so the reported numbers are not riding the edge of correctness.

Where the Limit Is, and How I'd Push Past It

At Tmin the read path still finishes in roughly 92 ps. The write path is what gives out first. Looking back at the one-cycle plot, roughly half of every period is spent pulling BL and BLb back up to VDD during precharge. That long φ1 is what squeezes the access budget and pins Tmin where it is.

The immediate next step would be to upsize the precharge PMOS pair in the column driver (currently W = 2). A wider device pulls the bitlines up faster, shortens φ1, and redistributes the freed-up time to the access phase. That translates into a direct drop in Tmin without touching any of the more sensitive blocks. I didn't have time to sweep the PMOS width and re-run the search, but it's the one lever I'd pull first.

Summary of Metrics

Process22nm PTM HP
Supply VDD0.8 V
Array Size16 × 4
Minimum Clock Period Tmin298 ps
Maximum Frequency fmax3.36 GHz
Worst-case Access Delay91.9 ps
Average Power54.74 μW
Bitcell Area (Σ widths)0.308 μm

This page is the highlight reel. The full report goes through every sub-block in detail: the 6T bitcell sizing, the two-phase clock circuit, the 4-to-16 decoder, the column driver, the isolated latch-type sense amp, the RdWr I/O block, the six timing invariants that guarantee correct operation, per-stage critical-path tables for both read and write, and the full figure-of-merit derivation.

Read the full report (PDF, 18 pages) →

Both of these projects are finished for now, but it would be interesting to combine them: the 8-bit ripple-carry adder from Project 1 and the 16×4 SRAM from this project already cover the datapath and storage halves of a digital system, and wrapping them together with a small controller and a few logical operations would turn them into a simple ALU. A possible follow-on if I come back to this.