Skip to content

Implementations of the Count-Min (CM) Sketch, Insert-only CM sketch, and the SpaceSaving sketch as part of UC Berkeley's Scalable Machine Learning class

Notifications You must be signed in to change notification settings

markfuge/Streaming-Data-Algorithms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

As part of an assignment for the Scalable Machine Learning course (CS 281B) at the University of California - Berkeley, I'm implementing and comparing three data streaming algorithms:
1) The Count-Min Sketch Algorithm - Cormode, Graham; S. Muthukrishnan (2004). "An Improved Data Stream Summary: The Count-Min Sketch and its Applications". J. Algorithms 55: 29–38.
2) An improved (i.e. tighter-bounded) Count-Min Sketch algorithm which only handles inserts (sacrificing removal capabilities).
3) The SpaceSaving sketch - Efficient Computation of Frequent and Top-k Elements in Data Streams by Ahmed Metwally, Divyakant Agrawal and Amr El Abbadi

While I don't anticipate that this implementation will be better than reference implementations by the original authors, it might come in handy for comparison purposes.

Input Data: This code was originally run on the english-language dump of Wikipedia, which was around 2GB compressed at the time. Now it should be significantly larger, and you can download it separately at: http://dumps.wikimedia.org/enwiki/


Mark Fuge
markfuge@gmail.com

About

Implementations of the Count-Min (CM) Sketch, Insert-only CM sketch, and the SpaceSaving sketch as part of UC Berkeley's Scalable Machine Learning class

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages