Lesson 2: Hash-Based & Sorted Data Structures

Master non-linear data structures, hashing fundamentals, and their Big O tradeoffs.

How Hashing Works in .NET

Before diving into Dictionary and HashSet, it's essential to understand the hashing mechanism that powers them. Every object in .NET has a GetHashCode() method (inherited from System.Object). When you store a key in a hash-based collection, the runtime calls GetHashCode() to compute a numeric hash, which is then mapped to a "bucket" — a slot in an internal array.

Retrieving an element means: compute the hash, find the bucket, then compare with Equals() to locate the exact item. Because this lookup doesn't depend on how many elements are in the collection (it goes directly to the right bucket), it runs in O(1) average time.

Hash collisions When two different keys produce the same hash code, they land in the same bucket — this is called a collision. .NET handles this with chaining (storing a linked list of entries per bucket). In the worst case (all keys collide), operations degrade to O(n). In practice, the default hash functions distribute well, and you rarely encounter this.

The Contract: GetHashCode() and Equals()

If you use custom objects as dictionary keys or in hash sets, you must correctly implement both GetHashCode() and Equals(). The rules are:

If a.Equals(b) is true, then a.GetHashCode() must equal b.GetHashCode().
The reverse is not required — two unequal objects may share the same hash code (collisions are allowed).
GetHashCode() must be consistent: calling it multiple times on the same object should return the same value (as long as the object hasn't changed).
Mutable objects used as keys are dangerous — if you change a field that affects the hash code after inserting the object, the dictionary can never find it again.

public class Employee
{
    public int Id { get; set; }
    public string Name { get; set; }

    public override bool Equals(object obj)
    {
        return obj is Employee other && Id == other.Id;
    }

    public override int GetHashCode()
    {
        return Id.GetHashCode(); // Consistent with Equals
    }
}

Modern shortcut In C# 9+, record types automatically generate correct Equals() and GetHashCode() implementations based on all their properties. Using record for dictionary keys is a clean, safe approach.

Dictionary<TKey, TValue>

The workhorse of key-value storage in C#. It maps a unique key to a value using hashing. It is unordered — you cannot rely on the order of enumeration.

Core Operations and Complexity

Operation	Complexity	Notes
`Add(key, value)`	O(1)	Amortized; O(n) on resize
`[key]` (get/set)	O(1)	Throws `KeyNotFoundException` on missing key
`ContainsKey(key)`	O(1)	Checks existence without retrieving value
`ContainsValue(value)`	O(n)	Must scan all values (no hash on values)
`Remove(key)`	O(1)	Removes by key
`TryGetValue(key, out value)`	O(1)	Safe lookup — no exception on missing key

Best Practices

// ALWAYS prefer TryGetValue over ContainsKey + indexer
// BAD: Two hash lookups
if (userMap.ContainsKey(userId))
{
    var user = userMap[userId]; // Second lookup!
}

// GOOD: Single hash lookup
if (userMap.TryGetValue(userId, out var user))
{
    // Use 'user' directly
}

// Pre-define capacity when you know the size
var lookup = new Dictionary<string, int>(50000);

Interview favorite "Why use TryGetValue instead of checking ContainsKey and then accessing the indexer?" Because ContainsKey + [key] performs two hash lookups, while TryGetValue does it in one. In hot loops, this doubles your hash computation cost unnecessarily.

C# 9+ Collection Expressions and Patterns

// Collection initializer syntax
var capitals = new Dictionary<string, string>
{
    ["France"] = "Paris",
    ["Japan"] = "Tokyo",
    ["Brazil"] = "Brasília"
};

// GetValueOrDefault (since .NET Core 2.0 / .NET 5+)
string capital = capitals.GetValueOrDefault("Italy", "Unknown");

HashSet<T>

HashSet<T> is a collection of unique elements. Think of it as a Dictionary where you only care about the keys. It uses the same hashing mechanism, giving it O(1) for Add, Contains, and Remove.

Set Operations

What makes HashSet<T> truly powerful are its mathematical set operations:

Method	Description	Math Equivalent
`UnionWith(other)`	Adds all elements from `other`	A ∪ B
`IntersectWith(other)`	Keeps only elements also in `other`	A ∩ B
`ExceptWith(other)`	Removes elements that are in `other`	A \ B
`SymmetricExceptWith(other)`	Keeps elements in one set but not both	A △ B
`IsSubsetOf(other)`	Checks if all elements are in `other`	A ⊆ B
`IsSupersetOf(other)`	Checks if all elements of `other` are here	A ⊇ B
`Overlaps(other)`	Checks if any element is shared	A ∩ B ≠ ∅

HashSet<string> teamA = new HashSet<string> { "Alice", "Bob", "Charlie" };
HashSet<string> teamB = new HashSet<string> { "Bob", "Diana", "Eve" };

// Who is on BOTH teams?
var both = new HashSet<string>(teamA);
both.IntersectWith(teamB); // {"Bob"}

// Who is on team A but NOT team B?
var onlyA = new HashSet<string>(teamA);
onlyA.ExceptWith(teamB); // {"Alice", "Charlie"}

// Case-insensitive set
var uniqueNames = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
uniqueNames.Add("Alice");
uniqueNames.Add("ALICE"); // Returns false — already exists

Sorted Collections

When you need your data to stay in sorted order at all times, .NET provides several options. They sacrifice the O(1) performance of hash-based structures for O(log n) operations in exchange for ordering.

SortedList<TKey, TValue> vs. SortedDictionary<TKey, TValue>

Both keep keys sorted, but their internal implementations differ significantly:

Feature	`SortedList`	`SortedDictionary`
Internal structure	Two parallel arrays (keys + values)	Red-black tree
Memory usage	Lower (contiguous arrays)	Higher (tree node overhead)
Insert / Remove	O(n) (array shifting)	O(log n)
Lookup by key	O(log n) (binary search)	O(log n)
Access by index	O(1) via `.Keys[i]`	Not supported
Best for	Data loaded once, queried often	Frequent insertions/removals

// SortedList — keeps entries sorted by key
var scores = new SortedList<string, int>();
scores.Add("Charlie", 92);
scores.Add("Alice", 88);
scores.Add("Bob", 95);

// Iterating produces: Alice=88, Bob=95, Charlie=92
foreach (var kvp in scores)
{
    Console.WriteLine($"{kvp.Key}: {kvp.Value}");
}

// Access by index (unique to SortedList)
string firstKey = scores.Keys[0]; // "Alice"

SortedSet<T>

The sorted counterpart of HashSet<T>. It keeps elements in sorted order using a red-black tree. It supports the same set operations (UnionWith, IntersectWith, etc.) and also offers GetViewBetween(lower, upper) to get a live view of a range of elements.

var sortedNums = new SortedSet<int> { 50, 10, 30, 20, 40 };
// Iteration order: 10, 20, 30, 40, 50

// Get a live range view (20 to 40 inclusive)
var range = sortedNums.GetViewBetween(20, 40);
// range: {20, 30, 40}
// If you add 25 to sortedNums, range automatically includes it

Decision Guide: Which Structure to Use?

Quick decision tree Do you need key-value pairs? → Dictionary or SortedDictionary. Just unique values? → HashSet or SortedSet. Need ordering? → Use the "Sorted" variant. Need max speed? → Use the hash-based variant. Need indexed access to sorted data? → SortedList.

Bonus: ConcurrentDictionary<TKey, TValue>

In multi-threaded scenarios, a regular Dictionary is not thread-safe. ConcurrentDictionary (in System.Collections.Concurrent) uses fine-grained locking (striped locks across buckets) to allow safe concurrent reads and writes without locking the entire collection.

Key methods:

TryAdd(key, value) — Adds if the key doesn't exist.
TryRemove(key, out value) — Atomically removes and returns.
GetOrAdd(key, valueFactory) — Returns the existing value or computes and adds a new one. The dictionary state is thread-safe, but the factory delegate may execute multiple times under contention (only one result is stored). If the factory has side effects (e.g., opening a connection), use Lazy<T> as the value type.
AddOrUpdate(key, addValue, updateFactory) — Adds or updates depending on key existence.

var cache = new ConcurrentDictionary<string, int>();

// Thread-safe "get or compute" pattern
int count = cache.GetOrAdd("pageViews", key => LoadFromDatabase(key));

// Atomic increment
cache.AddOrUpdate("pageViews", 1, (key, old) => old + 1);

// Safe pattern: Lazy<T> ensures factory runs only once
var safeCache = new ConcurrentDictionary<string, Lazy<ExpensiveObject>>();
var value = safeCache.GetOrAdd(
    "key",
    k => new Lazy<ExpensiveObject>(() => new ExpensiveObject(k))
).Value;

Interview trap: GetOrAdd is not fully atomic The GetOrAdd and AddOrUpdate factory delegates may execute more than once under thread contention. Only one result is stored, but if your factory allocates resources (database connections, file handles), you can leak the duplicates. The standard fix is to wrap the value in Lazy<T> so the expensive work only happens once, when .Value is accessed.

Coding Challenge

Given an array of integers with many duplicates, write a function that returns a Dictionary<int, int> where the key is the number and the value is its frequency. Do this in O(n) time.

Bonus: Make the function generic so it works with any IEnumerable<T>.

View Solution

// Basic solution — int-specific
public static Dictionary<int, int> CountFrequencies(int[] numbers)
{
    var freq = new Dictionary<int, int>(numbers.Length);

    foreach (int num in numbers)
    {
        // TryGetValue: single lookup instead of ContainsKey + indexer
        if (freq.TryGetValue(num, out int count))
        {
            freq[num] = count + 1;
        }
        else
        {
            freq[num] = 1;
        }
    }

    return freq;
}

// BONUS: Generic version using CollectionsMarshal (since .NET 6)
using System.Runtime.InteropServices;

public static Dictionary<T, int> CountFrequencies<T>(
    IEnumerable<T> items) where T : notnull
{
    var freq = new Dictionary<T, int>();

    foreach (T item in items)
    {
        // GetValueRefOrAddDefault returns a ref to the value slot
        ref int count = ref CollectionsMarshal
            .GetValueRefOrAddDefault(freq, item, out _);
        count++;
    }

    return freq;
}

// Usage:
int[] data = { 1, 3, 2, 3, 1, 1, 4, 3 };
var result = CountFrequencies(data);
// {1: 3, 3: 3, 2: 1, 4: 1}

Analysis: We iterate once through the array, performing one O(1) dictionary lookup per element. Total: O(n). Pre-allocating the dictionary capacity avoids internal resizing. The bonus solution uses CollectionsMarshal.GetValueRefOrAddDefault for a single-lookup increment — the fastest possible pattern in .NET 6+.

Previous ← Core Interfaces & Linear Structures Next Lesson Intro to LINQ & Execution Models →