## The Library of Congress and a Random String

From discussion on embryophysics

29.02.2012 22:34 Steve McGrew:

If the whole drive is filled with a maximally compressed portion of the Library of Congress, it would contain a collection of 0’s and 1’s with no detectable internal correlations. That is, it would be random according to all statistical measures. In order to describe the contents of the drive in that case, the state of each bit would need to be specified individually. It would require at least 8 trillion bits to describe it.

01.03.2012 20:46 Evgenii Rudnyi:

I am not sure if I understand why you say that the content of the Library of Congress is random. I would say not. Either the content is compressed or not, in my view, this does not change the fact that the information in the books is not random.

Even if the compressed form looks random, it is actually not, as one can decompress the archive and restore normal books.

02.03.2012 00:36 Steve McGrew:

A maximally compressed digital version of any information (i.e., the Library of Congress) is devoid of internal correlations — because those internal correlations are what is reduced by data compression. “Maximally compressed” would mean that there are no further internal correlations to exploit for compression.

What distinguishes a random sequence of bits from a nonrandom sequence is precisely the internal correlations. The “degree of randomness” of a sequence corresponds to the length of the shortest possible algorithm that can generate the same sequence. Any algorithm to generate the sequence efficiently will take advantage of internal correlations in the sequence. That is, it corresponds to the algorithmic complexity of the sequence.

A maximally compressed version of a sequence *IS* effectively the shortest algorithm that can generate the original sequence. In practice, no attempt is made to maximally compress data. Instead, a standard algorithm is used to detect and exploit a very limited subset of the possible correlations to produce a new sequence which, when run through an inverse of the algorithm, will regenerate the original sequence. Consequently, some correlations always remain after ordinary data compression.

When you say, “Even if the compressed form looks random, it is actually not, as one can decompress the archive and restore normal books”, I think you’re stepping into a very messy topic. It is not possible to decompress the archive without using the decompression algorithm, so the decompression algorithm effectively /contains/ part of the information that’s to be decompressed. A bigger compression/decompression algorithm can allow a higher degree of compression.

02.03.2012 20:39 Evgenii Rudnyi:

I understand that one needs a decompression algorithm. Yet, when we compress the Library of Congress, we know that such an algorithm exists. If we take a really random string, then presumably such a decompression algorithm just do not exist.