UPenn ESE 3700 · Spring 2026

22nm CMOS Datapath & Memory

Transistor-level design of an 8-bit adder and a 16×4 SRAM (static random-access memory) array in the 22nm process provided by the course, covering the datapath and storage halves of any digital system. From Boolean derivation through Elmore delay modeling and SPICE (Simulation Program with Integrated Circuit Emphasis) validation. Projects from ESE 3700 at the University of Pennsylvania.

Process

22nm HP PTM

Supply

V_DD = 0.8 V

Tools

Electric VLSI + SPICE

Project 1: 8-bit Ripple-Carry Adder

An 8-bit ripple-carry adder, built and simulated in Electric VLSI (very-large-scale integration). The complete datapath is eight bit-slices chained carry-to-carry: a half adder at bit 0 feeding seven full adders, each stage's C_out rippling up into the next stage's C_in.

The complete 8-bit ripple-carry adder. A half adder at bit 0 cascaded with seven full-adder bit-slices; each C_outfeeds the next stage's C_in. On the right, the assembled cell packaged as a single 8-bit adder symbol.

Full Adder Bit-Slice

Each bit-slice is a full adder. Its two outputs are:

S = A ⊕ B ⊕ Cin
Cout = AB + Cin(A ⊕ B)

Both depend on A ⊕ B, which made XOR2 (a two-input exclusive-OR gate) the natural primitive to build around: compute it once, then feed it into both the sum and carry paths.

Full adder bit-slice schematic: NAND2 generates AB-bar, two XOR2 stages compute sum, two NAND2 cells compute carry

Unfortunately the only screenshot of the full adder I had saved is cropped a little.

The full adder bit-slice combines the two primitives. One NAND2 (a two-input NAND gate) computes AB in parallel with the first XOR2 (P = A ⊕ B). That intermediate P feeds both the sum path (a second XOR2 producing S = P ⊕ C_in) and the carry path, where two right-hand NAND2 cells realize C_out = AB + C_in(A ⊕ B).

Sharing P between sum and carry is what makes this topology efficient: the most expensive intermediate is computed once and reused. Eight of these bit-slices are what make up the adder above.

XOR2 Implementation

That bit-slice leans on an XOR2, so the last decision was which XOR2 to actually build it from. (Several alternative sum and carry topologies were explored and rejected along the way; that analysis is in the full report.) Two candidates were carried forward and compared head-to-head at the full 8-bit system level:

Baseline XOR2 cell built from four NAND2 gates

Baseline: four-NAND2 XOR2, sixteen transistors (16 T). Every node fully restored, easy to analyze, conservative.

Transistor-level schematic of the optimized transmission-gate XOR2 cell with output buffer

Optimization: transmission-gate (TG) XOR2 with two-inverter output buffer (10 T). Same logic, 40% fewer transistors, and a shorter signal path. The buffer restores the TG's voltage output to a full rail-to-rail logic level before the next stage.

SPICE Results

Both designs were simulated across three transitions: an all-zeros to all-ones rising edge, the reverse falling edge, and a full carry ripple (255 + 1) that forces a carry to propagate through all eight bit-slices. That last case is the worst case for any ripple-carry adder, and it is the one that sets the maximum operating frequency.

Side-by-side SPICE propagation delay comparison: TG-XOR (optimized) vs NAND4-XOR (baseline) across three test cases

Side-by-side SPICE waveforms. Left column: optimized TG-XOR design. Right column: NAND4-XOR baseline. The bottom row is full carry propagation (255 + 1), the worst case. Optimized settles 44 ps earlier.

The two cells trade wins by transition type. On the rising edge the TG-XOR is actually slower, 60.8 ps against the baseline's 38.0 ps, because the transmission gate passes a weak high that the output buffer then has to restore. On the falling edge the two are within 1.4 ps of each other. The result that matters is the bottom row: on full carry propagation the TG-XOR settles in 116.3 ps versus the baseline's 160.4 ps, a 44.1 ps (27%) reduction. Because carry propagation sets the adder's maximum operating frequency, winning that worst case is what makes the optimized cell the better choice, even though it gives up ground on the non-critical rising transition.

Baseline Delay

160 ps

Optimized Delay

120 ps

Carry-Prop Speedup

−27%

Baseline Area

330 T

Optimized Area

240 T

Leakage Tradeoff

10×

The Lesson: When Elmore Lies

The most useful thing I learned on this project was not the topology exploration, it was what happens when your analytical model disagrees with simulation. Elmore delay predicted that widening the NAND2 PMOS (p-channel transistor) to 2×W_min would cut pull-up delay roughly in half when driving the TG-XOR2 input. SPICE said the opposite: sizing up actually made carry propagation worse. The first-order RC (resistance-capacitance) model captures the resistance drop from a wider PMOS but not the extra capacitance added to internal nodes. I reverted to minimum-sized transistors everywhere. The numbers above come from that version.

The Leakage Tradeoff

The optimized design wins on delay and area, but transmission gates create a partially-conducting path from supply to ground at the worst-case input (A = B = 1). That's a ~10× leakage penalty vs. the fully static NAND-only baseline. At minimum-leakage inputs the two designs are comparable. The choice between them depends on duty cycle: high-throughput, mostly-switching paths want the optimized cell; always-on, low-activity datapaths want the baseline.

Design	Input State	Leakage per Delay Period
Baseline (NAND4-XOR)	A = B = 1 (max)	0.025 fJ
Baseline (NAND4-XOR)	A = B = 0 (min)	0.016 fJ
Optimized (TG-XOR)	A = B = 1 (max)	0.21 fJ
Optimized (TG-XOR)	A = B = 0 (min)	0.019 fJ

Values scaled to each design's worst-case delay period (160 ps baseline, 116 ps optimized) rather than the raw 100 ps simulation window, so both designs are compared over one full clock cycle of their own.

The full write-up covers the design-space exploration (sum options S1 through S4 and carry options C1 through C3), Elmore derivations with extracted capacitance and resistance values, all schematics, and the complete SPICE validation methodology.

Read the full report (PDF, 25 pages) →

Project 2: 16×4 SRAM

A synchronous 16×4 SRAM (16 words of 4 bits) in the same 22nm process, targeting at least 500 MHz operation on a single clock input. The design spans the full memory system: a 6T (six-transistor) bit cell, non-overlapping clock generation, row and column decoders, sense amplifiers, column drivers, and a voltage midpoint reference. It is graded on a figure of merit that combines bitcell area, power, and delay squared, so every sub-block has to be sized for both correctness and score.

Top-Level Architecture

The full memory fits on a single schematic. Address inputs A₀–A₃ are latched and fed through a 4-to-16 row decoder gated by φ₂, driving sixteen word-lines (WL) into the 16×4 bitcell array. Four column drivers precharge BL/BLb (the bit-line pair) during φ₁ and drive writes during φ₂; four isolated latch-type sense amps are armed by SAE (sense-amp enable), a delayed combination of φ₂ and the write signal Wr, and resolve reads onto a per-column bus shared with the write path. A two-phase non-overlapping clock generator takes the single external clock (Clk) and produces the two phases that keep precharge and access strictly separated.

Top-level schematic of the 16x4 SRAM: address latches, 4-to-16 decoder, 16x4 bitcell array, four column drivers, four sense amplifiers, two-phase clock generator, and per-column read/write blocks

Top-level schematic. All control signals (PCHb, the precharge line; SAE; WrEn, the write enable; and the inverted Wr) come from the two-phase clock and the Wr fanout network along the bottom of the diagram.

Two design decisions drive the rest of the implementation. First, PCHb is wired directly to φ₁, which forces precharge to finish before φ₂ can rise. The non-overlap gap is what guarantees the column drivers and word-lines never fight each other. Second, the sense amp is an isolated latch-type pair armed by a φ₂-derived SAE, so it only fires once the bitline differential has developed past its offset. Every other block exists to feed these two.

6T Bitcell

The whole array is sixteen by four of one cell. Two cross-coupled inverters hold the bit; two word-line-gated access transistors connect it to BL/BLb. The 4:2:1 PD:AX:PU (pull-down : access : pull-up) width ratio is what lets a read happen without flipping the stored value while a write can still overpower it. The report sizes and walks through every block built around it.

6T SRAM bitcell schematic: two cross-coupled inverters storing the bit, two word-line-gated access transistors connecting to BL and BLb, with the packaged Bitcell symbol on the right

The 6T bitcell: a cross-coupled inverter pair for storage plus two word-line-gated access transistors onto BL/BLb. Width ratio PD:AX:PU = 4:2:1 (88 nm : 44 nm : 22 nm). On the right, the cell packaged as the symbol tiled across the array.

One Full Clock Cycle

The easiest way to see the whole design work at once is to overlay every relevant signal on one clock period. I ran characterization at 320 ps (3.125 GHz), which is a deliberately safe choice: about 7% above the ~298 ps absolute minimum found by binary search further down, so no waveform is sitting right on its own failure boundary.

One 320 ps clock cycle with phi1, phi2, PCHb, WL1, BL0/BLb0, SAE, and the sense-amp output Q/D0 overlaid, showing the four phases: precharge, non-overlap, access pre-SAE, and access post-SAE

One 320 ps cycle, split into four phases. During precharge (roughly 0–150 ps), PCHb is low and BL/BLb are pulled back up to V_DD. A short non-overlap gap separates precharge from access. When φ₂ rises, WL fires, the bitlines start to develop a differential, and a delayed SAE arms the sense amp only after that differential is past 50 mV. Q/D resolves cleanly well before the next precharge begins.

Functional Verification

Correctness was validated bottom-up, from the bitcell up to the full array. The most interesting test is mainTest2: two different 4-bit words written to two different addresses, then read back. This checks that the decoder routes the correct row and that storing one word does not disturb another.

mainTest2 SPICE waveforms: 1001 written to address 0000, 0110 written to address 1111, then both read back across eight nanoseconds

mainTest2: 1001 → address 0000, then 0110 → address 1111, then read both back. Each read matches the written word exactly; no cross-talk between rows.

How Fast Can It Go?

Minimum operating period was found with a binary search: shrink the clock period until reads stop matching the values that were written, then bracket the break point.

Clock-period binary search: left panel plots read-back voltage vs clock period and shows it transitioning through VDD/2 near 295 ps; right panel shows pass/fail bars across frequencies

Binary search for T_min. The read-back voltage on a known-0 cycle sits cleanly at 0 V for long periods and snaps up to V_DD for short ones; the transition crosses the V_DD/2 decision threshold near 298 ps, setting f_max ≈ 3.36 GHz, well above the project's 500 MHz requirement. The 320 ps operating point used for the earlier waveforms and for the power measurement sits about 7% above this break point, so the reported numbers are not riding the edge of correctness.

Where the Limit Is, and How I'd Push Past It

At T_min the read path still finishes in roughly 92 ps. The write path is what gives out first. Looking back at the one-cycle plot, roughly half of every period is spent pulling BL and BLb back up to V_DD during precharge. That long φ₁ is what squeezes the access budget and pins T_min where it is.

The immediate next step would be to upsize the precharge PMOS pair in the column driver (currently W = 2). A wider device pulls the bitlines up faster, shortens φ₁, and redistributes the freed-up time to the access phase. That translates into a direct drop in T_min without touching any of the more sensitive blocks. I didn't have time to sweep the PMOS width and re-run the search, but it's the one lever I'd pull first.

Summary of Metrics

Process	22nm PTM HP
Supply V_DD	0.8 V
Array Size	16 × 4
Minimum Clock Period T_min	298 ps
Maximum Frequency f_max	3.36 GHz
Worst-case Access Delay	91.9 ps
Average Power	54.74 μW
Bitcell Area (Σ widths)	0.308 μm

This page is the highlight reel. The full report goes through every sub-block in detail: the 6T bitcell sizing, the two-phase clock circuit, the 4-to-16 decoder, the column driver, the isolated latch-type sense amp, the read/write input/output (I/O) block, the six timing invariants that guarantee correct operation, per-stage critical-path tables for both read and write, and the full figure-of-merit derivation.

Read the full report (PDF, 18 pages) →

Both of these projects are finished for now, but it would be interesting to combine them: the 8-bit ripple-carry adder from Project 1 and the 16×4 SRAM from this project already cover the datapath and storage halves of a digital system, and wrapping them together with a small controller and a few logical operations would turn them into a simple arithmetic logic unit (ALU). A possible follow-on if I come back to this.