-
Notifications
You must be signed in to change notification settings - Fork 2
markfuge/Streaming-Data-Algorithms
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
As part of an assignment for the Scalable Machine Learning course (CS 281B) at the University of California - Berkeley, I'm implementing and comparing three data streaming algorithms: 1) The Count-Min Sketch Algorithm - Cormode, Graham; S. Muthukrishnan (2004). "An Improved Data Stream Summary: The Count-Min Sketch and its Applications". J. Algorithms 55: 29–38. 2) An improved (i.e. tighter-bounded) Count-Min Sketch algorithm which only handles inserts (sacrificing removal capabilities). 3) The SpaceSaving sketch - Efficient Computation of Frequent and Top-k Elements in Data Streams by Ahmed Metwally, Divyakant Agrawal and Amr El Abbadi While I don't anticipate that this implementation will be better than reference implementations by the original authors, it might come in handy for comparison purposes. Input Data: This code was originally run on the english-language dump of Wikipedia, which was around 2GB compressed at the time. Now it should be significantly larger, and you can download it separately at: http://dumps.wikimedia.org/enwiki/ Mark Fuge markfuge@gmail.com
About
Implementations of the Count-Min (CM) Sketch, Insert-only CM sketch, and the SpaceSaving sketch as part of UC Berkeley's Scalable Machine Learning class
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published