Talk:Hash table
This is the talk page for discussing improvements to the Hash table article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1, 2, 3Auto-archiving period: 12 months ![]() |
![]() | This ![]() It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||
|
Unfinished question
[edit]Problem 1: The Hash Table will be used for storage of student records. You may assume the maximum number of Items will not exceed 240. An item corresponds to a student record. A key is taken to be an ASCII character string of maximum length 32 containing name of the student. The data is taken to be 9-digit id and the name of the state of residence of the student in India. A list node contains an Item and a reference to the next node. You will need to implement the classes Item and the ListNode with construct and appropriate data access operations. 2. Write a main() program to test your class HashTable. 3. Input: The name of the input file, containing data items (student records) and data operations in the format given below, and the name of the output file, will be passed to your program on the command line or interactively. Data Items: student-id, student-name, state-of-residence Example. 200412018, Swati Mishra, Uttar Pradesh one per line. Data Operations: <operation> <item> The <operation> is s,i, or d for search, insert and delete, respectively. The item will have fields: student-name, student-id, and state-of-residence. The fields of an item will be separated by commas, and operation will be separated from the item by a colon ”:” and a space. Example. s: 200211001, Akash Gokhale, Maharashtra one per line. The data items will be separated from the data operations by a blank line. 4. Output: Your program is to read the input file and populate the Hash Table with student records by repeated Insert() operations. And then print to the output file, the size of the Hash Table, and the size of each linked list in the Hash Table. Then, it will continue to read each line and execute appropriate data operations. Following each Insert()/Delete() operation, it will output the size of the Hash Table and the size of each of the linked list, and for each Search operation, it will output
-- 220.225.53.35 09:18, 11 October 2006
Lead
[edit]The lead says: Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash collisions...Such collisions must be accommodated in some way. .
We don't live in an ideal world, so starting out with a statement about ideal is supercilious. The implication is that if a hash function is 'imperfect', like an assembly line making imperfect parts, why don't we fix it so it's 'perfect'? An imperfect hash function might cause hash collisions, but the sense of this statement is vague, and beggars the question, why don't we build one that 'might not' cause hash collisions? Easy enough to postulate, eh? We might lead a novice into actually attempting to do that. I wish him luck.
How about this in place of the whole paragraph: Hash functions are designed to be efficient to compute and minimize collisions, duplicate hash codes resulting from distinct keys. 19 words versus 48. And it's very informative. The lead is not the place to go into what happens in a collision: we may assume, and indeed we find, a whole section talking about just that. Hmmm... or we could add, Hash functions are accompanied by an adjunct collision-resolution method. Now I think we've got a substantial digest. Scholarly diction is an art. Sbalfour (talk) 21:47, 24 September 2019 (UTC)
- "Perfect hash function" is a term of art - https://en.wikipedia.org/wiki/Perfect_hash_function. So the characterisation of non-injective hash functions as "imperfect" here isn't meant as a slight against them or a value judgement on their quality or usefulness, it's just a jargon way of saying that they have collisions.
- I think I nonetheless agree with you that the passage feels a bit weird to me, for a few reasons:
- It doesn't link to Perfect hash function, and it's non-obvious what "imperfect" means here unless you either happen to already know the jargon or you get a spidey sense that it's a jargon term and google it
- It's kind of tautological. "most hash table designs employ an imperfect hash function, which might cause hash collisions" - yes, that's literally what being imperfect means.
- The "ideal" scenario is simply impossible in the extremely common scenario that there are infinitely many possible keys (e.g. keys are strings of arbitrary length). You cannot have a perfect hash function with an infinite domain.
- ExplodingCabbage (talk) 17:33, 10 July 2024 (UTC)
Let's try again. The lead says, In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. For this reason,.... What reason? It's a dangling participle, with look-thru ambiguity, it's not attached to anything. If we turn the sentence on its head, we get: Hash tables are more efficient in many situations than other things because <reason>. We need to state the reason, then we can say, For that reason,... How about Hash tables provide nearly constant access time and compact storage for large numbers of records, that can't be efficiently stored or accessed by other methods. There's ordered and unordered lists, linked lists, double-linked lists, circularly-linked lists, tries, heaps, stacks, queues, and etc, too numerous to mention in the lead, so don't try. If it matters - and it's worth a paragraph or maybe a sub-sub-section - there we distinguish the situations where we might use, other methods. Sbalfour (talk) 22:19, 24 September 2019 (UTC)
- I don't understand this second point of yours. It's not a dangling participle, it's unambiguous that the reason being talked about is that hash tables are more efficient than other stuff, and I have no idea what "look-thru ambiguity" / "look-through ambiguity" is meant to mean because I find almost no Google hits for either term (indeed this Talk page is the ONLY hit for "look-thru ambiguity"). ExplodingCabbage (talk) 17:35, 10 July 2024 (UTC)
Variable-sized data and chained hashing
[edit]Would this be a good citation for the better performance of variable-sized data in chaining hash tables? https://www.researchgate.net/profile/Haval_Ahmed2/post/Any_ideas_about_hash_table_implementation/attachment/59d6430bc49f478072eabc10/AS:273806436831232@1442291949107/download/In-memory+hash+tables+for+accumulating+text+vocabularies.pdf ?
First sentence of section "2.3. Hash tables" seems to be useful, but it is not backed up in the article itself.
Hash digest
[edit]Where is the hash digest ...
This explanation in section Hashing by division
is unclear because the term hash digest is not explained and does not appear in the equation above it. (Also, it should probably not be capitalized because it is not a complete sentence.)
In the following section, Hashing by multiplication
, it is the other way around: appears in the equation but is not explained. Gnib Bnil (talk) 14:55, 5 June 2024 (UTC)
Recent paper disproving Yao's conjecture; proves uniform probing is sub-optimal
[edit]I don't know enough to say how this would be added to this article, but it looks like a fundamental advance.
https://arxiv.org/pdf/2501.02305
Described here: https://www.quantamagazine.org/undergraduate-upends-a-40-year-old-data-science-conjecture-20250210/
> A young computer scientist and two colleagues show that searches within data structures called hash tables can be much faster than previously deemed possible.
> “It’s an important paper,” said Alex Conway of Cornell Tech in New York City. “Hash tables are among the oldest data structures we have. And they’re still one of the most efficient ways to store data.” Yet open questions remain about how they work, he said. “This paper answers a couple of them in surprising ways.”