TCP Protocol Internals: A Deep Research

Every time you load a webpage, send an email, transfer a file, or SSH into a remote server, the Transmission Control Protocol (TCP) is silently doing the heavy lifting beneath the surface. It is the backbone of reliable communication on the Internet — yet most developers and engineers interact with it only through high-level socket APIs, never peeling back the layers to understand what actually happens on the wire and inside the kernel, and how beautiful it is.

TCP is deceptively simple on the surface: it takes a stream of bytes from an application, delivers them reliably and in order to the other end, and handles all the messy realities of packet loss, network congestion, reordering, and flow control in between. But behind that clean abstraction lies over four decades of engineering — a state machine with 11 states, multiple retransmission strategies, a family of congestion control algorithms that are still actively researched today, and a header design that has been carefully extended through options while maintaining backward compatibility with implementations from the 1980s.

Whether you’re a backend engineer debugging latency spikes in production, a network engineer tuning kernel parameters for high-throughput servers, a security researcher analyzing TCP-based attacks, or a student preparing for a systems interview — understanding TCP at the protocol level gives you a powerful mental model for reasoning about networked systems.

In this post, we go far beyond the textbook three-way handshake. We’ll walk through the byte-level segment structure, dissect how sequence numbers, acknowledgments, and sliding windows actually work, explore the evolution of congestion control from Tahoe to BBR, examine the TCP state machine transition by transition, cover modern extensions like MPTCP, TCP Fast Open, RACK, and TLP, and look at how all of this is implemented inside the Linux kernel. By the end, you’ll have a comprehensive, reference-grade understanding of TCP internals.

Introduction & Historical Context
TCP in the Protocol Stack
TCP Segment Structure
Connection Lifecycle
Reliable Data Transfer Mechanisms
Flow Control
Congestion Control
TCP Timers
TCP Options
TCP State Machine
Advanced Topics & Modern Extensions
Security Considerations
TCP vs. Other Transport Protocols
Implementation Details (Kernel Perspective)
Conclusion

1. Introduction & Historical Context

What is TCP?

The Transmission Control Protocol (TCP) is a connection-oriented, reliable, byte-stream transport-layer protocol defined primarily in RFC 793 (1981), with numerous subsequent RFCs refining and extending it. It is one of the two original core protocols of the Internet Protocol Suite (the other being UDP), and it provides the reliability layer upon which the vast majority of Internet applications — HTTP, SMTP, FTP, SSH, TLS — are built.

Historical Evolution

Year	Milestone
1974	Vint Cerf & Bob Kahn publish “A Protocol for Packet Network Intercommunication” — TCP and IP were originally a single protocol
1978	TCP split into TCP (transport) and IP (network) — TCP/IP model born
1981	RFC 793 — the foundational TCP specification
1983	ARPANET switches from NCP to TCP/IP (“Flag Day”, January 1)
1988	Van Jacobson introduces congestion control algorithms (Tahoe) after “congestion collapse” events
1990	TCP Reno — Fast Recovery added
1996	RFC 2018 — Selective Acknowledgments (SACK)
2004	RFC 3782 — NewReno refinements
2006	CUBIC congestion control (default in Linux)
2013	RFC 6824 — Multipath TCP (MPTCP)
2016	BBR congestion control by Google
2022	RFC 9293 — consolidates and obsoletes RFC 793 and multiple related RFCs

Design Philosophy

TCP was designed around several key principles:

End-to-end principle: Intelligence at the endpoints, dumb network core
Robustness principle (Postel’s Law): “Be conservative in what you send, be liberal in what you accept”
Layered architecture: TCP is agnostic to what IP does below and what applications do above

2. TCP in the Protocol Stack

TCP sits between the application layer and the network layer. It takes a byte stream from the application, segments it, hands segments to IP, and on the receiving end, reassembles them back into an ordered byte stream.

Key Abstractions TCP Provides Over IP

3. TCP Segment Structure

3.1 Header Format

Minimum header size: 20 bytes (no options)
Maximum header size: 60 bytes (with options)

3.2 Field-by-Field Deep Dive

Source Port & Destination Port (16 bits each)

Together with source/destination IP addresses, they form a 4-tuple that uniquely identifies a TCP connection (socket pair).
Well-known ports: 0–1023; Registered: 1024–49151; Dynamic/Ephemeral: 49152–65535.
The ephemeral port range is OS-dependent (Linux default: 32768–60999).

Sequence Number (32 bits)

Identifies the byte position within the stream of the first data byte in this segment.
If SYN is set, this is the Initial Sequence Number (ISN), and the first data byte is ISN+1.
32-bit space = 4,294,967,296 bytes before wrapping — at 10 Gbps, this wraps in ~3.4 seconds, creating challenges addressed by PAWS (Protection Against Wrapped Sequences).

Acknowledgment Number (32 bits)

The next sequence number the sender of the ACK expects to receive.
This is a cumulative acknowledgment: it implicitly acknowledges all bytes up to (but not including) this number.
Only meaningful when the ACK flag is set (which is virtually always after the SYN).

Data Offset (4 bits)

Specifies the size of the TCP header in 32-bit words.
Minimum value: 5 (= 20 bytes); Maximum value: 15 (= 60 bytes).
Tells the receiver where the data begins.

Reserved (4 bits)

Originally 6 bits (reduced as CWR and ECE flags were added).
Must be set to zero.

Flags (Control Bits) — 8 bits

Detailed Flag Semantics:

SYN: Consumes one sequence number. Used in the three-way handshake. Carries initial parameters (MSS, window scale, SACK-permitted, timestamps, etc.) as options.
FIN: Also consumes one sequence number. Indicates the sender’s byte stream has ended. The connection is half-closed — the other side can still send data.
RST: Immediately terminates the connection. No ACK is expected or required. Common triggers: connection to a closed port, aborting a connection, responding to invalid segments.
PSH: An advisory flag telling the receiving TCP stack to deliver data to the application without waiting for buffer fill. Many implementations set this on every segment containing data.
URG: Historically used for “out-of-band” data (e.g., Telnet interrupt). Largely deprecated; RFC 6093 discourages its use.
ACK: Set on virtually every segment after the initial SYN. A segment with only ACK (no data) is sometimes called a “bare ACK” or “pure ACK.”

Window Size (16 bits)

The number of bytes the sender of this segment is willing to accept (the receive window, rwnd).
With the Window Scale option, this is a base value that gets left-shifted, allowing windows up to 2^30 = 1 GiB.

Checksum (16 bits)

Covers the pseudo-header (source IP, destination IP, protocol number, TCP length), the TCP header, and the data.
Computed as the 16-bit one’s complement of the one’s complement sum of all 16-bit words.
Mandatory in TCP (unlike UDP where it’s optional in IPv4).
Known to be weak — does not detect all error patterns. Modern NICs often perform checksum offloading.

Urgent Pointer (16 bits)

An offset from the sequence number indicating the last byte of urgent data.
Only meaningful when URG is set.
Largely obsolete.

Options (variable)

Covered in detail in Section 9.

4. Connection Lifecycle

4.1 Three-Way Handshake (Connection Establishment)

Why Three Ways?

The three-way handshake solves three problems:

Both sides agree to communicate (mutual consent)
Both sides synchronize sequence numbers (x and y are both communicated and acknowledged)
Prevents old duplicate connection initiations from being accepted (the ISN validation)

Initial Sequence Number (ISN) Selection:

ISNs must be unpredictable to prevent TCP sequence prediction attacks (RFC 6528).
Modern implementations use a combination of:
- A secret key
- Source/destination IP and port
- A clock-based component
- Cryptographic hash (e.g., MD5, SipHash)
Linux uses SipHash over the 4-tuple plus a secret, plus a time-based component.

SYN Queue and Accept Queue:

On the server side, the kernel maintains two queues:

SYN Queue: Stores connections in SYN_RCVD state. Size governed by tcp_max_syn_backlog.
Accept Queue: Stores fully established connections waiting for accept(). Size governed by min(backlog, somaxconn).

SYN Cookies (RFC 4987):
When the SYN queue is full, Linux can use SYN cookies — the server encodes connection state into the ISN of the SYN-ACK, avoiding the need to store any state. The ISN encodes:

A timestamp (5 bits, granularity ~64 seconds)
MSS index (3 bits)
A cryptographic hash of the 4-tuple and the timestamp

Trade-off: TCP options from the SYN (like window scaling, SACK, timestamps) are lost since no state is stored.

4.2 Simultaneous Open

Both sides can simultaneously send SYN to each other, resulting in a four-way handshake:

This is rare but fully supported by the TCP specification.

4.3 Connection Termination

Normal Close: Four-Way Handshake

Three-Way Close (Piggyback)

If the server has no more data, it can combine its ACK and FIN:

Half-Close

TCP supports half-duplex close: one side sends FIN (indicating it’s done sending), but the other side can continue sending data. This is used by applications like HTTP/1.0 where the server sends FIN after the response body, but the client’s request was already complete.

4.4 TIME_WAIT State

Duration: 2 × MSL (Maximum Segment Lifetime). RFC 793 defines MSL as 2 minutes, so TIME_WAIT = 4 minutes. Linux uses 60 seconds (hardcoded as TCP_TIMEWAIT_LEN).

Why TIME_WAIT exists:

Reliable termination: If the final ACK is lost, the peer will retransmit its FIN. The TIME_WAIT state ensures the endpoint can still respond.
Prevent old duplicates: Ensures that delayed segments from this connection don’t get misinterpreted as belonging to a new connection using the same 4-tuple.

TIME_WAIT Implications:

On busy servers, thousands of sockets in TIME_WAIT can exhaust ephemeral ports or consume memory.
Linux mitigations:
- tcp_tw_reuse: Allows reusing TIME_WAIT sockets for outgoing connections if the timestamp is newer.
- SO_REUSEADDR / SO_REUSEPORT: Allow binding to addresses/ports already in use.
- tcp_max_tw_buckets: Limits total TIME_WAIT sockets.
- tcp_tw_recycle (removed in Linux 4.12 — it was broken behind NATs).

4.5 Reset (RST)

RST immediately tears down a connection. Common scenarios:

Scenario	Description
Connection to closed port	SYN arrives at a port with no listener
Aborting a connection	Application calls `close()` with `SO_LINGER` set to 0
Half-open connection detection	One side crashed and rebooted; the other side sends data and gets RST
Firewall intervention	Middlebox injects RST to terminate connections

RST attacks: An attacker who can guess the sequence number in the receive window can inject an RST and terminate a connection. Mitigated by:

Randomized ISNs
RFC 5961: Requires RST sequence number to match exactly rcv.nxt, or else send a “challenge ACK”

5. Reliable Data Transfer Mechanisms

5.1 Sequence Numbers and Acknowledgments

TCP provides reliability through a combination of:

Sequence numbers: Every byte of data is numbered.
Cumulative ACKs: The ACK number indicates “I’ve received everything up to this byte.”
Retransmission: Lost segments are retransmitted.

5.2 Retransmission Strategies

Timeout-Based Retransmission (RTO)

If an ACK is not received within the Retransmission Timeout (RTO), the segment is retransmitted.

RTO Calculation (RFC 6298):

Exponential Backoff: After each timeout, RTO is doubled (capped at an upper bound). This is Karn’s algorithm — retransmitted segments are not used to update RTT estimates (ambiguity problem).

Fast Retransmit (RFC 5681)

Instead of waiting for a timeout, the sender retransmits upon receiving 3 duplicate ACKs (4 ACKs for the same sequence number total):

This is much faster than waiting for RTO, which can be hundreds of milliseconds to seconds.

Selective Acknowledgment (SACK) — RFC 2018

Cumulative ACKs are wasteful when there are multiple losses in a window. SACK allows the receiver to inform the sender about non-contiguous blocks that have been received:

This tells the sender: “I’m missing 1500-1999 and 2500-2999, but I have the rest.”

SACK internals:

Negotiated during the handshake via the SACK-Permitted option.
SACK blocks are carried in the SACK option (kind=5) in ACK segments.
Up to 4 SACK blocks per segment (limited by option space, especially if timestamps are used — then only 3 blocks).
The sender maintains a scoreboard tracking which bytes have been SACKed, allowing it to retransmit only truly lost segments.

Duplicate SACK (D-SACK) — RFC 2883

Extends SACK to indicate that a segment was received more than once. This helps the sender distinguish between:

Genuine packet loss
Packet reordering
ACK loss
Spurious retransmissions

5.3 Retransmission Ambiguity

When a retransmitted segment is ACKed, the sender can’t tell if the ACK is for the original or the retransmission. Solutions:

Karn’s Algorithm: Don’t update RTT estimates on retransmitted segments.
TCP Timestamps (RFC 7323): Each segment carries a timestamp; the ACK echoes it back, disambiguating RTT measurement.

6. Flow Control

6.1 Sliding Window Mechanism

TCP uses a sliding window protocol for flow control. The receiver advertises a receive window (rwnd) — the number of bytes it’s willing to accept.

Key variables (sender):

SND.UNA — oldest unacknowledged byte
SND.NXT — next byte to send
SND.WND — send window (= receiver’s advertised rwnd)
Usable window = SND.UNA + SND.WND - SND.NXT

Key variables (receiver):

RCV.NXT — next expected byte
RCV.WND — receive window (advertised to sender)

6.2 Zero Window and Window Probes

When the receiver’s buffer is full, it advertises rwnd = 0. The sender must stop sending data. To recover:

The sender starts a persist timer.
When it fires, the sender sends a window probe — a segment with 1 byte of data (or zero-length).
The receiver responds with an ACK containing the current window size.
If still zero, the sender backs off exponentially and probes again.

This prevents deadlock: without probes, a window update from the receiver could be lost, and both sides would wait forever.

6.3 Silly Window Syndrome (SWS)

Problem: If the receiver opens the window by tiny amounts, and the sender sends tiny segments, efficiency plummets (high overhead-to-data ratio).

Solutions:

Receiver side (Clark’s algorithm / RFC 1122):

Don’t advertise a window increase until the window is at least min(MSS, buffer_size/2).

Sender side (Nagle’s algorithm — RFC 896):

Nagle’s algorithm reduces the number of small segments (“tinygrams”) on the network. However, it can introduce latency for interactive applications (e.g., SSH, gaming), so it can be disabled with TCP_NODELAY.

Interaction with Delayed ACKs: Nagle + delayed ACKs can cause pathological 200ms delays. When an application does two small writes followed by a read (common in request-response protocols), Nagle holds the second write until the first is ACKed, and delayed ACK holds the ACK for up to 200ms. Solution: TCP_NODELAY, TCP_CORK, or using vectored I/O (writev).

6.4 Window Scaling — RFC 7323

The 16-bit window field limits the advertised window to 65,535 bytes. This is insufficient for high-bandwidth, high-latency paths (the bandwidth-delay product):

A 64 KB window would only allow ~5.2 Mbps throughput on this path.

Window Scale option: Negotiated in the SYN/SYN-ACK, specifies a shift count (0–14):

7. Congestion Control

Congestion control is arguably the most complex and researched aspect of TCP. It controls the sending rate to avoid overwhelming the network.

7.1 Core Concept: The Congestion Window

The effective sending window is:

Where:

cwnd = congestion window (sender’s estimate of network capacity)
rwnd = receive window (receiver’s buffer capacity)

7.2 Classic Algorithms (TCP Reno family)

Slow Start (RFC 5681)

Despite the name, slow start is exponential growth:

Congestion Avoidance

Linear growth (additive increase):

This is the AIMD (Additive Increase, Multiplicative Decrease) phase.

Loss Detection and Response

TCP Tahoe (1988):

On any loss (timeout or 3 dup ACKs):
- ssthresh = cwnd / 2
- cwnd = 1 MSS
- Enter Slow Start

TCP Reno (1990):

On timeout:
- Same as Tahoe
On 3 duplicate ACKs (Fast Retransmit + Fast Recovery):
- ssthresh = cwnd / 2
- cwnd = ssthresh + 3 MSS (inflate for the 3 dup ACKs)
- For each additional dup ACK: cwnd += MSS
- When new ACK arrives: cwnd = ssthresh (deflate), enter Congestion Avoidance

TCP NewReno (RFC 3782):

Fixes Reno’s behavior with multiple losses in a single window.
Stays in Fast Recovery until all data outstanding at the time of loss detection is ACKed (tracks the “recovery point”).
Handles partial ACKs (ACKs that advance SND.UNA but don’t cover the recovery point) by retransmitting the next suspected lost segment.

7.3 SACK-based Loss Recovery

With SACK, the sender maintains a scoreboard and can precisely retransmit only lost segments:

This is far more efficient than Reno/NewReno for multiple losses.

7.4 Proportional Rate Reduction (PRR) — RFC 6937

Modern Linux uses PRR instead of classic Fast Recovery. PRR smoothly reduces cwnd during recovery rather than the sharp halving and re-inflation of Reno:

7.5 CUBIC — RFC 8312

Default congestion control in Linux since 2.6.19 (2006).

CUBIC uses a cubic function of time since the last congestion event to determine cwnd:

Key properties:

Window-based, not rate-based
RTT-fairness: The cubic function is based on elapsed time, not RTTs, making it fairer across connections with different RTTs (unlike Reno where high-RTT connections grow slower)
Aggressive growth far from W_max, conservative near it (the flat part of the cubic curve provides stability)
TCP-friendly region: Falls back to Reno-like behavior when CUBIC would be less aggressive

7.6 BBR (Bottleneck Bandwidth and RTT) — Google, 2016

BBR represents a paradigm shift from loss-based to model-based congestion control.

Core philosophy: Loss is not a reliable signal of congestion. In networks with deep buffers, loss-based algorithms fill buffers, causing bufferbloat (high latency). BBR instead estimates:

BtlBw — bottleneck bandwidth (maximum delivery rate)
RTprop — round-trip propagation delay (minimum RTT)

The optimal operating point is:

This is the Kleinrock optimal — maximum throughput with minimum delay.

BBR State Machine:

BBR pacing: Unlike window-based algorithms, BBR controls the rate at which packets are sent using pacing (spacing packets evenly), reducing burstiness.

BBR versions:

BBRv1: Initial version, known for unfairness to loss-based flows and intra-protocol unfairness
BBRv2: Addresses fairness and excessive loss issues; uses ECN signals, pacing improvements
BBRv3: Further refinements (still evolving as of 2024)

7.7 ECN (Explicit Congestion Notification) — RFC 3168

Instead of dropping packets to signal congestion, ECN-capable routers can mark packets:

Flow:

Sender sets ECT(0) or ECT(1) in IP packets
Congested router changes ECT → CE (instead of dropping)
Receiver sees CE, sets ECE flag in subsequent ACKs
Sender sees ECE, reduces cwnd, sets CWR flag
Receiver sees CWR, stops sending ECE

Benefits: Avoids packet loss entirely; enables faster congestion response (especially with algorithms like DCTCP that react proportionally to the fraction of marked packets).

7.8 Summary of Congestion Control Algorithms

8. TCP Timers

TCP maintains several timers per connection:

8.1 Retransmission Timer (RTO Timer)

Set when a segment is sent and no ACK is pending.
Fires when an ACK hasn’t arrived within the RTO.
On expiry: retransmit the oldest unacknowledged segment, double the RTO (exponential backoff).
Cleared when all outstanding data is acknowledged.

8.2 Persist Timer

Set when the receiver advertises rwnd = 0.
Fires to trigger a window probe.
Exponential backoff, but never gives up (connections can survive zero-window indefinitely).

8.3 Keepalive Timer

Optional mechanism to detect dead connections (RFC 1122).
Default: After 2 hours of inactivity, send a keepalive probe.
If no response after tcp_keepalive_probes (default 9) probes sent tcp_keepalive_intvl (default 75 seconds) apart, the connection is considered dead.
Linux parameters:

Total timeout = 7200 + 75 × 9 = 7875 seconds ≈ 2.2 hours

8.4 TIME_WAIT Timer (2MSL Timer)

Duration: 2 × MSL (60 seconds on Linux).
Ensures the connection 4-tuple is not reused too soon.

8.5 Delayed ACK Timer

TCP delays ACKs by up to 40ms (Linux) or 200ms (RFC recommendation) to piggyback ACKs on data going the other direction.
An ACK is sent immediately if:
- Two full-size segments received (every-other-segment ACK)
- An out-of-order segment arrives
- The delayed ACK timer expires

8.6 FIN_WAIT_2 Timer

In FIN_WAIT_2 state, if the connection is orphaned (closed by application), Linux sets a timer (tcp_fin_timeout, default 60 seconds) to prevent indefinite wait.

9. TCP Options

TCP options provide extensibility. They are carried in the TCP header between the fixed 20-byte header and the data.

Key Options

Kind	Length	Name	Description	RFC
0	1	End of Options	Marks end of options list	793
1	1	NOP	Padding/alignment	793
2	4	MSS	Maximum Segment Size	793
3	3	Window Scale	Window scaling factor	7323
4	2	SACK Permitted	Enables SACK	2018
5	var	SACK	Selective ACK blocks	2018
8	10	Timestamps	TSval and TSecr	7323
14	3	TCP-AO	Authentication Option	5925
28	var	User Timeout	UTO	5482
29	var	TCP-AO	Authentication (updated)	5925
30	var	Multipath TCP	MPTCP signaling	6824
34	var	TCP Fast Open	TFO cookie	7413
253-254	var	Experimental	RFC 6994	—

9.1 MSS (Maximum Segment Size) — Kind 2

Sent only in SYN segments.
Declares the largest segment the sender is willing to receive.
Does NOT include IP or TCP headers.
Typical values:
- Ethernet: 1460 bytes (1500 MTU – 20 IP – 20 TCP)
- IPv6: 1440 bytes (1500 – 40 IPv6 – 20 TCP)
- Loopback: 65495 bytes
If not present, defaults to 536 bytes.
Path MTU Discovery (RFC 1191) can further constrain the effective MSS.

9.2 Timestamps — Kind 8

TSval (Timestamp Value): Sender’s current timestamp clock.
TSecr (Timestamp Echo Reply): Echoes the most recent TSval received from the peer.

Uses:

RTTM (Round-Trip Time Measurement): More accurate than ACK-based RTT estimation, works with SACK.
PAWS (Protection Against Wrapped Sequences): Detects old duplicate segments even after sequence numbers wrap. Uses timestamps as a 32-bit extension of the sequence space.

9.3 TCP Fast Open (TFO) — RFC 7413

Allows data to be carried in the SYN packet of a connection, saving one RTT:

Security: The cookie (generated by the server using a secret key) prevents blind SYN+data flooding from spoofed IPs.

Limitations: Application must be idempotent for the SYN data (since it might be replayed). Not universally deployed due to middlebox interference.

10. TCP State Machine

The TCP state machine has 11 states:

State Descriptions

State	Description
`CLOSED`	No connection exists
`LISTEN`	Server waiting for incoming SYN
`SYN_SENT`	Client has sent SYN, awaiting SYN-ACK
`SYN_RCVD`	Server has received SYN, sent SYN-ACK, awaiting ACK
`ESTABLISHED`	Connection is open; data transfer in progress
`FIN_WAIT_1`	Application has closed; FIN sent, awaiting ACK
`FIN_WAIT_2`	FIN has been ACKed; awaiting peer’s FIN
`CLOSE_WAIT`	Received FIN from peer; waiting for application to close
`CLOSING`	Both sides sent FIN simultaneously; awaiting ACK
`LAST_ACK`	Sent FIN after receiving peer’s FIN; awaiting final ACK
`TIME_WAIT`	Waiting 2×MSL before fully closing

Monitoring states: netstat -ant, ss -ant in Linux.

11. Advanced Topics & Modern Extensions

11.1 Multipath TCP (MPTCP) — RFC 6824 / RFC 8684

MPTCP allows a single TCP connection to use multiple network paths simultaneously:

Key Features:

Backward compatible: Falls back to regular TCP if middleboxes interfere
Subflow management: Add/remove paths dynamically
Coupled congestion control: Ensures MPTCP doesn’t take unfair share at shared bottlenecks
Used by: Apple (iOS Siri since iOS 7, all iOS/macOS apps can use it), Linux kernel (upstream since 5.6)

11.2 TCP in Data Centers

Data center TCP has unique requirements: very low latency, high bandwidth, shallow buffers.

DCTCP (Data Center TCP) — RFC 8257:

Uses ECN marks proportionally: Instead of halving cwnd on any ECN mark (like classic ECN), DCTCP reduces cwnd proportionally to the fraction of marked packets.
Maintains very low queue occupancy.
cwnd = cwnd × (1 - α/2) where α ∈ [0,1] is the moving average of the fraction of marked packets.

Other data center innovations:

NDP (NDP at SIGCOMM 2017): Receiver-driven flow control
HPCC (High Precision Congestion Control): Uses in-network telemetry (INT)
Swift (Google): Delay-based, fabric-aware congestion control

11.3 TCP over Wireless/Lossy Links

TCP’s assumption that packet loss = congestion is wrong for wireless links, where random bit errors cause loss.

Approaches:

Freeze-TCP: Receiver proactively advertises zero window before handoff
Westwood+: Uses ACK rate to estimate bandwidth, avoids aggressive cwnd reduction on random loss
Link-layer ARQ: Retransmit at the link layer (802.11 retransmissions) before TCP notices
Split TCP: Performance-enhancing proxies (PEPs) terminate TCP at the wireless boundary

11.4 TCP Offloading

Modern NICs can offload TCP processing:

Offload Type	Description
Checksum Offload	NIC computes/verifies TCP checksum
TSO (TCP Segmentation Offload)	Kernel sends large (up to 64KB) segments; NIC splits into MSS-sized
GRO (Generic Receive Offload)	NIC/driver aggregates multiple segments into one large segment for kernel
LRO (Large Receive Offload)	Hardware-based aggregation (less flexible than GRO)
TOE (TCP Offload Engine)	Full TCP stack on the NIC — controversial; limited adoption

11.5 Tail Loss Probe (TLP) — RFC 8985

TLP addresses the problem of tail losses — losses at the end of a transaction that can only be recovered via RTO (since no further data triggers dup ACKs):

11.6 RACK (Recent ACKnowledgment) — RFC 8985

RACK uses time-based loss detection instead of counting duplicate ACKs:

Advantages over dup-ACK counting:

Works with SACK and non-SACK
Not confused by reordering
Not limited by the “3 dup ACK” threshold
Better for connections with small windows (fewer than 4 packets in flight)

11.7 TCP_NOTSENT_LOWAT

Controls how much unsent data can be buffered in the kernel socket buffer. This is crucial for latency-sensitive applications (e.g., video streaming, gaming):

int val = 16384; // 16 KB
setsockopt(fd, IPPROTO_TCP, TCP_NOTSENT_LOWAT, &val, sizeof(val));

When the unsent data drops below this threshold, epoll/select reports the socket as writable, allowing the application to generate fresh data instead of buffering stale data.

12. Security Considerations

12.1 SYN Flood Attack

Attack: Attacker sends massive numbers of SYN packets with spoofed source IPs. The server allocates resources for each half-open connection, exhausting memory/CPU.

Mitigations:

SYN cookies (eliminate server-side state for half-open connections)
SYN proxies (middlebox completes handshake before forwarding)
Rate limiting SYN packets
Increasing backlog size
Firewalls with SYN flood protection

12.2 TCP Reset Attack

Attack: Attacker injects RST segments with guessed sequence numbers to terminate connections.

Mitigations:

RFC 5961: RST must have sequence number exactly equal to RCV.NXT to be accepted immediately; otherwise, a “challenge ACK” is sent.
Randomized ISNs
TCP-AO (Authentication Option)
IPsec

12.3 TCP Hijacking / Injection

Attack: Attacker injects data into an established connection by guessing sequence and acknowledgment numbers.

Mitigations:

Randomized ISNs
Encrypted transport (TLS)
TCP-AO (RFC 5925) — cryptographic authentication of segments
IPsec (AH or ESP)

12.4 TCP-AO (Authentication Option) — RFC 5925

Replaces the older TCP MD5 Signature Option (RFC 2385, used heavily for BGP):

Uses HMAC with configurable algorithms
Supports key rollover
Provides per-segment integrity and authentication

12.5 Side-Channel Attacks

CVE-2016-5696: Linux vulnerability where the global challenge_ack_limit rate limiter could be used as a side channel to infer sequence numbers of connections between two other hosts (off-path attack).
Mitigations: Randomized challenge ACK limits, noise injection.

12.6 TCP and Firewalls / NATs

TCP’s stateful nature means firewalls and NATs maintain connection tracking tables:

Issues:

NAT timeout can silently drop idle connections (TCP keepalive helps)
Stateful firewalls can be overwhelmed by many connections
Middlebox interference with TCP options (window scale, timestamps, SACK, TFO)
MPTCP designed to be middlebox-friendly (falls back gracefully)

13. TCP vs. Other Transport Protocols

TCP vs. UDP

Feature	TCP	UDP
Connection	Connection-oriented	Connectionless
Reliability	Guaranteed delivery	Best effort
Ordering	Ordered	Unordered
Flow control	Yes (sliding window)	No
Congestion control	Yes	No
Head-of-line blocking	Yes (in-order delivery)	No
Header size	20–60 bytes	8 bytes
Use cases	HTTP, SMTP, SSH, FTP	DNS, VoIP, gaming, video streaming

TCP vs. SCTP (Stream Control Transmission Protocol)

Feature	TCP	SCTP
Streams	Single byte stream	Multiple independent streams
HOL blocking	Yes	No (per-stream ordering)
Multi-homing	No	Yes (multiple IP addresses per endpoint)
Message boundaries	No (byte stream)	Yes (message-oriented)
Connection setup	3-way handshake	4-way handshake (with cookie for anti-DoS)
Adoption	Universal	Limited (WebRTC data channels, telecom)

TCP vs. QUIC

QUIC (RFC 9000) was designed to fix TCP’s limitations:

Feature	TCP	QUIC
Transport	Kernel-space	User-space (over UDP)
Encryption	Optional (TLS layered on top)	Mandatory (TLS 1.3 integrated)
Connection setup	1-3 RTTs (TCP + TLS)	0–1 RTT
HOL blocking	Yes	No (independent streams)
Connection migration	No (tied to 4-tuple)	Yes (connection ID)
Middlebox ossification	Severe	Minimal (encrypted headers)
Congestion control	Kernel-managed	Application-managed
Loss recovery	Per-connection	Per-stream

4. Implementation Details (Kernel Perspective)

14.1 Linux TCP Implementation

Linux’s TCP implementation is one of the most sophisticated and widely deployed. Key source files:

14.2 Socket Buffers (sk_buff)

The sk_buff (socket buffer) is the fundamental data structure for network packets in Linux:

struct sk_buff {
    struct sk_buff      *next, *prev;     // linked list
    struct sock         *sk;              // owning socket
    struct net_device   *dev;             // network device
    
    unsigned char       *head;            // start of buffer
    unsigned char       *data;            // start of data
    unsigned char       *tail;            // end of data
    unsigned char       *end;             // end of buffer
    
    unsigned int        len;              // data length
    __u32               priority;
    
    // ... many more fields
    
    unsigned char       cb[48];           // control buffer (TCP uses for tcp_skb_cb)
};

TCP stores per-segment metadata in the cb field via struct tcp_skb_cb:

struct tcp_skb_cb {
    __u32       seq;                // Starting sequence number
    __u32       end_seq;            // SEQ + FIN + SYN + datalen
    __u32       tcp_tw_isn;         // ISN in TIME_WAIT
    struct {
        __u16   tcp_gso_segs;
        __u16   tcp_gso_size;
    };
    __u8        tcp_flags;          // TCP header flags
    __u8        sacked;             // SACK/FACK state bits
    __u32       ack_seq;            // ACK sequence number
    // ...
};

14.3 TCP Connection Lookup

When a packet arrives, the kernel must find the matching socket. Linux uses a hash table with multiple levels:

The hash table is RCU-protected for lock-free reads, critical for performance on multi-core systems.

14.4 TCP Memory Management

Autotuning: Linux dynamically adjusts socket buffer sizes based on connection characteristics:

Receive buffer grows up to tcp_rmem[2] based on observed BDP
Send buffer grows up to tcp_wmem[2]
Controlled by tcp_moderate_rcvbuf

14.5 Key Sysctls for TCP Tuning

Sysctl	Default	Description
`tcp_window_scaling`	1	Enable window scaling
`tcp_sack`	1	Enable SACK
`tcp_timestamps`	1	Enable timestamps
`tcp_ecn`	2	ECN: 0=off, 1=request, 2=accept
`tcp_fastopen`	1	TFO: bitmask (1=client, 2=server)
`tcp_congestion_control`	cubic	Default congestion control
`tcp_slow_start_after_idle`	1	Reset cwnd after idle period
`tcp_no_metrics_save`	0	Cache/don’t cache route metrics
`tcp_max_syn_backlog`	128-1024	SYN queue size
`somaxconn`	4096	Accept queue size
`tcp_synack_retries`	5	SYN-ACK retransmissions
`tcp_syn_retries`	6	SYN retransmissions
`tcp_fin_timeout`	60	FIN_WAIT_2 timeout (seconds)
`tcp_tw_reuse`	2	TIME_WAIT reuse for outgoing
`tcp_max_tw_buckets`	262144	Max TIME_WAIT sockets
`tcp_abort_on_overflow`	0	RST on accept queue overflow
`tcp_mtu_probing`	0	PLPMTUD (Path Layer MTU Discovery)

14.6 Pluggable Congestion Control

Linux allows runtime selection of congestion control algorithms:

# List available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# Set default
sysctl -w net.ipv4.tcp_congestion_control=bbr
# Per-socket (programmatic)
setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3);

The tcp_congestion_ops structure:

struct tcp_congestion_ops {
    void (*init)(struct sock *sk);
    void (*release)(struct sock *sk);
    u32  (*ssthresh)(struct sock *sk);
    void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked);
    void (*set_state)(struct sock *sk, u8 new_state);
    void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
    void (*in_ack_event)(struct sock *sk, u32 flags);
    void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
    u32  (*undo_cwnd)(struct sock *sk);
    u32  (*sndbuf_expand)(struct sock *sk);
    // ...
};

14.7 GRO/GSO Pipeline

GSO (Generic Segmentation Offload): The kernel creates a “super-segment” (up to 64KB). If the NIC supports TSO, it’s segmented in hardware. If not, GSO segments it in software just before transmission — but after routing decisions, keeping the cost of processing per-packet once.

15. Conclusion

TCP is a remarkably sophisticated protocol that has evolved over four decades while maintaining backward compatibility. Its key architectural decisions — reliable byte-stream abstraction, end-to-end principle, multiplicative-decrease congestion control, three-way handshake — have proven remarkably durable.

Key Takeaways

TCP is a living protocol: From RFC 793 (1981) to RFC 9293 (2022), with continuous evolution via new congestion control algorithms (CUBIC, BBR), loss recovery mechanisms (RACK, TLP, PRR), and extensions (MPTCP, TFO).
Congestion control is the heart of TCP: The shift from loss-based (Tahoe/Reno) to model-based (BBR) represents a fundamental rethinking, though loss-based algorithms (CUBIC) remain dominant.
The tension between TCP and modern requirements: TCP’s in-order delivery causes head-of-line blocking. Its kernel-space implementation makes iteration slow. Its header ossification (middlebox interference) limits extensibility. These motivate QUIC.
TCP still dominates: Despite QUIC’s growth, TCP carries the majority of Internet traffic and will continue to do so for decades given its deep integration into every operating system, network device, and application.
Tuning matters: Understanding TCP internals — buffer sizing, congestion control selection, option negotiation, timer configuration — is essential for achieving optimal performance in specific environments (data centers, WANs, lossy links).

TCP Protocol Internals: A Deep Research

Table of Contents

1. Introduction & Historical Context

What is TCP?

Historical Evolution

Design Philosophy

2. TCP in the Protocol Stack

Key Abstractions TCP Provides Over IP

3. TCP Segment Structure

3.1 Header Format

3.2 Field-by-Field Deep Dive

Source Port & Destination Port (16 bits each)

Sequence Number (32 bits)

Acknowledgment Number (32 bits)

Data Offset (4 bits)

Reserved (4 bits)

Flags (Control Bits) — 8 bits

Window Size (16 bits)

Checksum (16 bits)

Urgent Pointer (16 bits)

Options (variable)

4. Connection Lifecycle

4.1 Three-Way Handshake (Connection Establishment)

4.2 Simultaneous Open

4.3 Connection Termination

Normal Close: Four-Way Handshake

Three-Way Close (Piggyback)

Half-Close

4.4 TIME_WAIT State

4.5 Reset (RST)

5. Reliable Data Transfer Mechanisms

5.1 Sequence Numbers and Acknowledgments

5.2 Retransmission Strategies

Timeout-Based Retransmission (RTO)

Fast Retransmit (RFC 5681)

Selective Acknowledgment (SACK) — RFC 2018

Duplicate SACK (D-SACK) — RFC 2883

5.3 Retransmission Ambiguity

6. Flow Control

6.1 Sliding Window Mechanism

6.2 Zero Window and Window Probes

6.3 Silly Window Syndrome (SWS)

6.4 Window Scaling — RFC 7323

7. Congestion Control

7.1 Core Concept: The Congestion Window

7.2 Classic Algorithms (TCP Reno family)

Slow Start (RFC 5681)

Congestion Avoidance

Loss Detection and Response

7.3 SACK-based Loss Recovery

7.4 Proportional Rate Reduction (PRR) — RFC 6937

7.5 CUBIC — RFC 8312

7.6 BBR (Bottleneck Bandwidth and RTT) — Google, 2016

7.7 ECN (Explicit Congestion Notification) — RFC 3168

7.8 Summary of Congestion Control Algorithms

8. TCP Timers

8.1 Retransmission Timer (RTO Timer)

8.2 Persist Timer

8.3 Keepalive Timer

8.4 TIME_WAIT Timer (2MSL Timer)

8.5 Delayed ACK Timer

8.6 FIN_WAIT_2 Timer

9. TCP Options

Key Options

9.1 MSS (Maximum Segment Size) — Kind 2

9.2 Timestamps — Kind 8

9.3 TCP Fast Open (TFO) — RFC 7413

10. TCP State Machine

State Descriptions

11. Advanced Topics & Modern Extensions

11.1 Multipath TCP (MPTCP) — RFC 6824 / RFC 8684

11.2 TCP in Data Centers

11.3 TCP over Wireless/Lossy Links

11.4 TCP Offloading

11.5 Tail Loss Probe (TLP) — RFC 8985

11.6 RACK (Recent ACKnowledgment) — RFC 8985

11.7 TCP_NOTSENT_LOWAT

12. Security Considerations

12.1 SYN Flood Attack

12.2 TCP Reset Attack