Lossy Count Algorithm

The lossy count algorithm is an algorithm to identify elements in a data stream whose frequency count exceed a user-given threshold. The algorithm works by dividing the Data Stream into ‘Buckets’ as for frequent items, but fill as many buckets as possible in main memory one time. The frequency computed by this algorithm is not always accurate, but has an error threshold that can be specified by the user. The run time space required by the algorithm is inversely proportional to the specified error threshold, hence larger the error, the smaller the footprint.

It was created by eminent computer scientists Rajeev Motwani and Gurmeet Singh Manku. This algorithm finds huge application in computations where data takes the form of a continuous data stream instead of a finite data set, for e.g. network traffic measurements, web server logs, clickstreams.

Algorithm[]

The general algorithm followed is outlined as follows^[1]

Step 1: Divide the incoming data stream into buckets of width $w=1/\epsilon$ , where $\epsilon$ is mentioned by user as the error bound (along with minimum support threshold = $\sigma$ ).
Step 2: Increment the frequency count of each item according to the new bucket values. After each bucket, decrement all counters by 1.
Step 3: Repeat – Update counters and after each bucket, decrement all counters by 1.

References[]

^ Han, Jiawei. (2006). Data mining : concepts and techniques. Kamber, Micheline. (2nd ed.). Amsterdam: Elsevier. ISBN 978-0-08-047558-5. OCLC 143252170.

Motwani, R; Manku, G.S (2002). "Approximate frequency counts over data streams". VLDB '02 Proceedings of the 28th International Conference on Very Large Data Bases: 346–357.

This computer science article is a stub. You can help Wikipedia by .

[1] Han, Jiawei. (2006). Data mining : concepts and techniques. Kamber, Micheline. (2nd ed.). Amsterdam: Elsevier. ISBN 978-0-08-047558-5. OCLC 143252170.

[1]