Hash Tables - Emanuel's Notes

#datastructures #algorithms A hash table (sometimes called Hash map), is a data structure that allows us to store key-value pairs of data. These key-value pairs are stored in an array of buckets from which we can retrieve the desired data. The hash table uses a **hash function** to determine what index in this array of buckets to retrieve the data from. During this lookup process the hash function hashes the key, which in turns becomes an index into this array that allows us to return the data. A perfect hash function would be one that no matter the input the output will always be unique. Most hash table implementations however implement imperfect hashing functions, therefore additional measures must be taken to handle **collision**, that instances where to different keys produce the same hashed index. Python's dictionaries are built-in hash tables. ```python d = {"A": 1, "B": 2, "C": 3} print(d["A"]) ``` A well implemented hash table has `O(1)` runtime for insertion, deletion and lookup. Let's implement a Hash Table in Python. ## Implementation ### Base Data Structure Here is the basic implementation of simple hash table that maps string keys to its values. The most important part of this simple hash table is the hashing function. The hashing process takes the order of each character in the key and sums them up. We then take the mod of this to map it to an index on the underlying array. ```python class HashMap: def __init__(self, size): self.hashmap = [None for i in range(size)] def key_to_index(self, key): return sum(ord(i) for i in key) % len(self.hashmap) def __repr__(self): buckets = [] for v in self.hashmap: if v != None: buckets.append(v) return str(buckets) ``` ### Insert To insert a value all we need to do is hash the key to get an index, then use this index to store the tuple `(key, value)` at that position. ```python class HashMap: def __init__(self, size): self.hashmap = [None for i in range(size)] def insert(self, key, value): idx = self.key_to_index(key) self.hashmap[idx] = (key, value) def key_to_index(self, key): return sum(ord(i) for i in key) % len(self.hashmap) def __repr__(self): buckets = [] for v in self.hashmap: if v != None: buckets.append(v) return str(buckets) ``` ### Retrieving data To retrieve data we simply need to hash, then grab the index from the array. In this case when the key does not exist, we return a useful message to the user. ```python def get(self, key): idx = self.key_to_index(key) val = self.hashmap[idx] if val is None: raise Exception("sorry, key not found") else: return val[1] ``` ### Optimizing underlying array The underlying array for our hash table so far has been set at the initialization of the the data structure. We would like to instead allow the user to set an initial size but as needed have the underlying hash table array grow to accommodate more and more data. The rules are very simple, we first calculate the proportion of the array used, if this is more than 5% then we create a new underlying array 10 times the size of the current one, and copy over all the elements to this one. The implementation is shown below. ```python def insert(self, key, value): self.resize() # print(f"inserting the {key}->{value}") index = self.key_to_index(key) self.hashmap[index] = (key, value) # print(f"the hashmap after insert: {self.hashmap}") def resize(self): if len(self.hashmap) == 0: self.hashmap.append(None) load = self.current_load() if load >= .05: # print(f"the load is: {load} therefore a resize is required") new_size = 10 * len(self.hashmap) old_hashmap_kv = [i for i in self.hashmap if i is not None] self.hashmap = [None for i in range(new_size)] for i in old_hashmap_kv: self.insert(i[0], i[1]) def current_load(self): if len(self.hashmap) == 0: return 0 count = 0 for i in self.hashmap: if i is not None: count += 1 return (count/len(self.hashmap)) ```