Computing Shannon Entropy for Information Density (ID)
Notes for Graphic Semiology Fundamentals from Session 1
Notes
- 1️⃣ Session 1: Graphic Semiology Fundamentals
- 1.3. An example from a research paper
- Research paper: Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche by Christophe Coupé, Yoon Mi Oh, Dan Dediu, and François Pellegrino.
About this study
Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche by Christophe Coupé, Yoon Mi Oh, Dan Dediu, and François Pellegrino.
Source: Science.org
In the study, Shannon entropy is used to estimate the Information Density ($ID$) per syllable, specifically as the second-order conditional entropy to account for syllable dependencies within words. Here's a simplified explanation of how it's computed, based on the methodology described:
Protocol for computing $ID$
1. Syllable Probabilities
- Collect a large written corpus for each language (e.g., texts, lexical databases).
- Transcribe the corpus phonetically and segment it into syllables (using rule-based programs or existing syllabification for some languages).
- For each language, calculate:
- Unigram probabilities: $p(x)$, the probability of each syllable $x$ occurring in the corpus.
- Bigram probabilities: $p(x, y)$, the probability of a syllable $y$ following a syllable $x$ within the same word (or a null marker for word-initial syllables).
2. First-Order Entropy ($ShE$)
- Compute the standard Shannon entropy for syllables (unigram-based):
$$
ShE = -\sum_{x} p(x) \cdot \log_2(p(x))
$$
- $p(x)$: Probability of syllable $x$.
- This measures the average uncertainty or information content per syllable, ignoring context.
3. Second-Order Entropy ($ID$)
- Compute the conditional entropy to account for syllable dependencies (bigrams within words):
$$
ID = -\sum_{x, y} p(x, y) \cdot \log_2\left(\frac{p(x, y)}{p(x)}\right)
$$
- $p(x, y)$: Joint probability of syllable $y$ following syllable $x$.
- $\dfrac{p(x, y)}{p(x)}$: Conditional probability $p(y|x)$, the likelihood of $y$ given $x$.
- This reflects the information content per syllable, considering the context of the previous syllable, making it a more accurate measure of linguistic information.
4. Information Rate ($IR$)
- Multiply the ID (bits per syllable) by the Speech Rate (SR) (syllables per second):
$$
IR = ID \cdot SR
$$
- $SR$ is calculated as the number of syllables ($NS$) divided by the duration of speech (in seconds, excluding pauses >$150$ ms).
Key Notes
- The data source: Large written corpora provide syllable frequencies, while spoken corpora ($170$ speakers, $17$ languages, $\sim 240,000$ syllables) provide $SR$.
- Why Conditional Entropy?: It accounts for syllable predictability within words, reducing redundancy compared to first-order entropy ($ShE$), which assumes syllables are independent.
- Result: Across $17$ languages, $ID$ varies (e.g., $4.8$ bits/syllable for Basque to $8.0$ for Vietnamese), but $IR$ converges around $\sim 39$ bits/s due to a trade-off between $ID$ and $SR$.
For detailed implementation, the study provides R code and data in the GitHub repository: https://github.com/keruiduo/SupplMatInfoRate.