Look, a lot of people are confused about entropy, negentropy, and their relationship to information. For example, here’s a portion of a book on information: https://www.fil.ion.ucl.ac.uk/~wpenny/course/info.pdf
In section 4.3 it states: “The entropy is the average information content”. That seems to settle the matter in favor of your interpretation, doesn’t it?
But wait! Just a few lines further it declares that “Entropy measures uncertainty.”
My point here is that there’s a great deal of confusion about the terminology, which makes it especially easy to get confused about what all these things mean. That’s why I prefer to found my thinking about entropy in terms of thermodynamics, which is based on physical phenomena, and is therefore easier to stay grounded with.
Think in terms of the Second Law of Thermodynamics:
The entropy of an isolated system can never decrease.
And in fact we know that the entropy of isolated systems almost always increases, the only exceptions being absolutely static systems, such as those near absolute zero.
Now think about what this means about our information about an isolated system, such as a bacterium isolated in a container. We can know a great deal about that bacterium, because we know that in order to be alive, it has to be carrying out lots of biochemical reactions. We also know that all its molecules are confined within its cell membrane.
But because the bacterium is isolated, it has no food, no sustenance. It will die. Its cell membrane will break down and its volatile molecules will evaporate and spread through the volume of the container. Thus, with the passage of time, the entropy of the system increases – AND we know less about the system. Where previously we could identify the positions of many of the atoms in the cell, now some of those atoms are now randomly scattered through the container. This is not a system of greater information, as you believe; it is a system about which we know less. It is more random, with higher entropy and we know less about it.
Your belief that entropy equals information content, when combined with Second Thermo means that, with the passage of time, the information content of every system (including the universe) increases. At the Big Bang, according to your interpretation, the entire universe had zero information content, but the information content has been steadily increasing as the universe has aged. All we have to do to gain more information about the universe is to merely wait for a while and new information will simply appear out of nowhere.
That’s not how the universe works. With the passage of time, we know less and less, because information degrades with time. This should be obvious from the Uncertainty Principle.
A system with high entropy is one that we have little information about. A system with low entropy has greater information content.
I suggest that the source of your confusion might come from mixing together two immiscible concepts: the information content of a system and the amount of data required to specify its state. To completely specify the state of a gas in a container, we must specify the position and momentum of every particle in the container. If the gas is at maximum entropy, then it is spread randomly through the container, and it will take a lot of RAM to store all different positions and momenta. But one low-entropy state would have all the gas packed into one corner of the container, so that the positions of all the particles could be specified by a much smaller statement along the lines of (x < a) & (y < b) & (z < c). Indeed, in the case of theoretically minimum entropy, we can specify the entire system with statements such as x = a, y = b, z = c, px = 0, py = 0, and pz = 0.
Information theory has equations that parallel the equations used in thermodynamics, but because information is a purely mathematical construct, it is impossible to think about those equations in anything other than purely mathematical terms, so there’s no way to intuitively understand those equations – you can only prove them.
This is why you misunderstand the meaning of Shannon’s paper. His formulation is highly abstract and easily misunderstood. You’d do better to read his later paper “Prediction and Entropy of Printed English”, in which he directly addresses the points we are considering here. Here’s a relevant quotation from that paper:
“From this analysis it appears that, in ordinary literary English, the long range statistical effects (up to 100 letters) reduce the entropy to something of the order of one bit per letter”
In other words, as you take into account constraints arising from semantic and syntactical factors extending over a longer length of text, the entropy of each character is reduced – that is, the information content of the passage is increased by the additional structural constraints, even as the amount of information required to STORE the text is reduced. Remember, the process elements are just as important as the data elements. Shannon carried out a fascinating experiment with his wife in predicting letters in a text. Nowadays this kind of experiment can be executed on vastly larger scales with computers.
There are quite a few academic papers discussing the information content of text in various languages. I have not followed this discipline, but the few papers I have read all make it clear that procedural constraints (semantic, syntactic, and thematic) decrease the number of bits required to store each letter of text – BUT only if you incorporate those constraints algorithmically. In other words, you substitute process for data. That doesn’t reduce information content; because process is universally applicable, it increases information content.