Lesson 2: Hash-Based & Sorted Data Structures
Master non-linear data structures, hashing fundamentals, and their Big O tradeoffs.
How Hashing Works in .NET
Before diving into Dictionary and HashSet, it's essential to understand the hashing mechanism that powers them. Every object in .NET has a GetHashCode() method (inherited from System.Object). When you store a key in a hash-based collection, the runtime calls GetHashCode() to compute a numeric hash, which is then mapped to a "bucket" — a slot in an internal array.
Retrieving an element means: compute the hash, find the bucket, then compare with Equals() to locate the exact item. Because this lookup doesn't depend on how many elements are in the collection (it goes directly to the right bucket), it runs in O(1) average time.
The Contract: GetHashCode() and Equals()
If you use custom objects as dictionary keys or in hash sets, you must correctly implement both GetHashCode() and Equals(). The rules are:
- If
a.Equals(b)istrue, thena.GetHashCode()must equalb.GetHashCode(). - The reverse is not required — two unequal objects may share the same hash code (collisions are allowed).
GetHashCode()must be consistent: calling it multiple times on the same object should return the same value (as long as the object hasn't changed).- Mutable objects used as keys are dangerous — if you change a field that affects the hash code after inserting the object, the dictionary can never find it again.
public class Employee { public int Id { get; set; } public string Name { get; set; } public override bool Equals(object obj) { return obj is Employee other && Id == other.Id; } public override int GetHashCode() { return Id.GetHashCode(); // Consistent with Equals } }
record types automatically generate correct Equals() and GetHashCode() implementations based on all their properties. Using record for dictionary keys is a clean, safe approach.
Dictionary<TKey, TValue>
The workhorse of key-value storage in C#. It maps a unique key to a value using hashing. It is unordered — you cannot rely on the order of enumeration.
Core Operations and Complexity
| Operation | Complexity | Notes |
|---|---|---|
Add(key, value) | O(1) | Amortized; O(n) on resize |
[key] (get/set) | O(1) | Throws KeyNotFoundException on missing key |
ContainsKey(key) | O(1) | Checks existence without retrieving value |
ContainsValue(value) | O(n) | Must scan all values (no hash on values) |
Remove(key) | O(1) | Removes by key |
TryGetValue(key, out value) | O(1) | Safe lookup — no exception on missing key |
Best Practices
// ALWAYS prefer TryGetValue over ContainsKey + indexer // BAD: Two hash lookups if (userMap.ContainsKey(userId)) { var user = userMap[userId]; // Second lookup! } // GOOD: Single hash lookup if (userMap.TryGetValue(userId, out var user)) { // Use 'user' directly } // Pre-define capacity when you know the size var lookup = new Dictionary<string, int>(50000);
TryGetValue instead of checking ContainsKey and then accessing the indexer?" Because ContainsKey + [key] performs two hash lookups, while TryGetValue does it in one. In hot loops, this doubles your hash computation cost unnecessarily.
C# 9+ Collection Expressions and Patterns
// Collection initializer syntax var capitals = new Dictionary<string, string> { ["France"] = "Paris", ["Japan"] = "Tokyo", ["Brazil"] = "Brasília" }; // GetValueOrDefault (since .NET Core 2.0 / .NET 5+) string capital = capitals.GetValueOrDefault("Italy", "Unknown");
HashSet<T>
HashSet<T> is a collection of unique elements. Think of it as a Dictionary where you only care about the keys. It uses the same hashing mechanism, giving it O(1) for Add, Contains, and Remove.
Set Operations
What makes HashSet<T> truly powerful are its mathematical set operations:
| Method | Description | Math Equivalent |
|---|---|---|
UnionWith(other) | Adds all elements from other | A ∪ B |
IntersectWith(other) | Keeps only elements also in other | A ∩ B |
ExceptWith(other) | Removes elements that are in other | A \ B |
SymmetricExceptWith(other) | Keeps elements in one set but not both | A △ B |
IsSubsetOf(other) | Checks if all elements are in other | A ⊆ B |
IsSupersetOf(other) | Checks if all elements of other are here | A ⊇ B |
Overlaps(other) | Checks if any element is shared | A ∩ B ≠ ∅ |
HashSet<string> teamA = new HashSet<string> { "Alice", "Bob", "Charlie" }; HashSet<string> teamB = new HashSet<string> { "Bob", "Diana", "Eve" }; // Who is on BOTH teams? var both = new HashSet<string>(teamA); both.IntersectWith(teamB); // {"Bob"} // Who is on team A but NOT team B? var onlyA = new HashSet<string>(teamA); onlyA.ExceptWith(teamB); // {"Alice", "Charlie"} // Case-insensitive set var uniqueNames = new HashSet<string>(StringComparer.OrdinalIgnoreCase); uniqueNames.Add("Alice"); uniqueNames.Add("ALICE"); // Returns false — already exists
Sorted Collections
When you need your data to stay in sorted order at all times, .NET provides several options. They sacrifice the O(1) performance of hash-based structures for O(log n) operations in exchange for ordering.
SortedList<TKey, TValue> vs. SortedDictionary<TKey, TValue>
Both keep keys sorted, but their internal implementations differ significantly:
| Feature | SortedList | SortedDictionary |
|---|---|---|
| Internal structure | Two parallel arrays (keys + values) | Red-black tree |
| Memory usage | Lower (contiguous arrays) | Higher (tree node overhead) |
| Insert / Remove | O(n) (array shifting) | O(log n) |
| Lookup by key | O(log n) (binary search) | O(log n) |
| Access by index | O(1) via .Keys[i] | Not supported |
| Best for | Data loaded once, queried often | Frequent insertions/removals |
// SortedList — keeps entries sorted by key var scores = new SortedList<string, int>(); scores.Add("Charlie", 92); scores.Add("Alice", 88); scores.Add("Bob", 95); // Iterating produces: Alice=88, Bob=95, Charlie=92 foreach (var kvp in scores) { Console.WriteLine($"{kvp.Key}: {kvp.Value}"); } // Access by index (unique to SortedList) string firstKey = scores.Keys[0]; // "Alice"
SortedSet<T>
The sorted counterpart of HashSet<T>. It keeps elements in sorted order using a red-black tree. It supports the same set operations (UnionWith, IntersectWith, etc.) and also offers GetViewBetween(lower, upper) to get a live view of a range of elements.
var sortedNums = new SortedSet<int> { 50, 10, 30, 20, 40 }; // Iteration order: 10, 20, 30, 40, 50 // Get a live range view (20 to 40 inclusive) var range = sortedNums.GetViewBetween(20, 40); // range: {20, 30, 40} // If you add 25 to sortedNums, range automatically includes it
Decision Guide: Which Structure to Use?
Dictionary or SortedDictionary. Just unique values? → HashSet or SortedSet. Need ordering? → Use the "Sorted" variant. Need max speed? → Use the hash-based variant. Need indexed access to sorted data? → SortedList.
Bonus: ConcurrentDictionary<TKey, TValue>
In multi-threaded scenarios, a regular Dictionary is not thread-safe. ConcurrentDictionary (in System.Collections.Concurrent) uses fine-grained locking (striped locks across buckets) to allow safe concurrent reads and writes without locking the entire collection.
Key methods:
TryAdd(key, value)— Adds if the key doesn't exist.TryRemove(key, out value)— Atomically removes and returns.GetOrAdd(key, valueFactory)— Returns the existing value or computes and adds a new one. The dictionary state is thread-safe, but the factory delegate may execute multiple times under contention (only one result is stored). If the factory has side effects (e.g., opening a connection), useLazy<T>as the value type.AddOrUpdate(key, addValue, updateFactory)— Adds or updates depending on key existence.
var cache = new ConcurrentDictionary<string, int>(); // Thread-safe "get or compute" pattern int count = cache.GetOrAdd("pageViews", key => LoadFromDatabase(key)); // Atomic increment cache.AddOrUpdate("pageViews", 1, (key, old) => old + 1); // Safe pattern: Lazy<T> ensures factory runs only once var safeCache = new ConcurrentDictionary<string, Lazy<ExpensiveObject>>(); var value = safeCache.GetOrAdd( "key", k => new Lazy<ExpensiveObject>(() => new ExpensiveObject(k)) ).Value;
GetOrAdd and AddOrUpdate factory delegates may execute more than once under thread contention. Only one result is stored, but if your factory allocates resources (database connections, file handles), you can leak the duplicates. The standard fix is to wrap the value in Lazy<T> so the expensive work only happens once, when .Value is accessed.
Coding Challenge
Given an array of integers with many duplicates, write a function that returns a Dictionary<int, int> where the key is the number and the value is its frequency. Do this in O(n) time.
Bonus: Make the function generic so it works with any IEnumerable<T>.
View Solution
// Basic solution — int-specific public static Dictionary<int, int> CountFrequencies(int[] numbers) { var freq = new Dictionary<int, int>(numbers.Length); foreach (int num in numbers) { // TryGetValue: single lookup instead of ContainsKey + indexer if (freq.TryGetValue(num, out int count)) { freq[num] = count + 1; } else { freq[num] = 1; } } return freq; } // BONUS: Generic version using CollectionsMarshal (since .NET 6) using System.Runtime.InteropServices; public static Dictionary<T, int> CountFrequencies<T>( IEnumerable<T> items) where T : notnull { var freq = new Dictionary<T, int>(); foreach (T item in items) { // GetValueRefOrAddDefault returns a ref to the value slot ref int count = ref CollectionsMarshal .GetValueRefOrAddDefault(freq, item, out _); count++; } return freq; } // Usage: int[] data = { 1, 3, 2, 3, 1, 1, 4, 3 }; var result = CountFrequencies(data); // {1: 3, 3: 3, 2: 1, 4: 1}
Analysis: We iterate once through the array, performing one O(1) dictionary lookup per element. Total: O(n). Pre-allocating the dictionary capacity avoids internal resizing. The bonus solution uses CollectionsMarshal.GetValueRefOrAddDefault for a single-lookup increment — the fastest possible pattern in .NET 6+.