Canterbury corpus

The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.^[1]

Contents[]

In its most commonly used form, the corpus consists of 11 files, selected as "average" documents from 11 classes of documents,^[2] totaling 2,810,784 bytes as follows.

Size (bytes)	File name	Description
152,089	alice29.txt	English text
125,179	asyoulik.txt	Shakespeare
24,603	cp.html	HTML source
11,150	fields.c	C source
3,721	grammar.lsp	LISP source
1,029,744	kennedy.xls	Excel spreadsheet
426,754	lcet10.txt	Technical writing
481,861	plrabn12.txt	Poetry (Paradise Lost)
513,216	ptt5	CCITT test set
38,240	sum	SPARC executable
4,227	xargs.1	GNU manual page

References[]

^ Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. ISBN 9781558605701.
^ Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032.

External links[]

The Canterbury Corpus

This computer science article is a stub. You can help Wikipedia by .

[1] Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. ISBN 9781558605701.

[2] Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032.

[1]

[2]

v t Standard test items
Pangram Reference implementation Sanity check Standard test image
Artificial intelligence	Chinese room Turing test
Television (test card)	SMPTE color bars Indian-head test pattern BBC Test Card A, B, C, D, E, F, G, H, G, W, X ETP-1 Philips PM 5538 PM 5540, PM 5552, PM 5544, PM 5644 Telefunken FuBK TVE test card UEIT
Computer languages	"Hello, World!" program Quine Trabb Pardo–Knuth algorithm Man or boy test Just another Perl hacker
Data compression	Calgary corpus Canterbury corpus
3D computer graphics	Cornell box Stanford bunny Stanford dragon Utah teapot
Machine learning	ImageNet MNIST database List
Typography	Hamburgevons Lorem ipsum The quick brown fox jumps over the lazy dog
Other	Acid3 "Bad Apple!!" EICAR test file GTUBE Harvard sentences Lenna "The North Wind and the Sun" "Tom's Diner" SMPTE universal leader EURion constellation Shakedown Webdriver Torso 1951 USAF resolution test chart

Canterbury corpus

Contents[]

See also[]

References[]

External links[]