Computing Shannon Entropy for Information Density (ID)

Notes for Graphic Semiology Fundamentals from Session 1

Notes

About this study

Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche by Christophe Coupé, Yoon Mi Oh, Dan Dediu, and François Pellegrino.
Source: Science.org

In the study, Shannon entropy is used to estimate the Information Density ($ID$) per syllable, specifically as the second-order conditional entropy to account for syllable dependencies within words. Here's a simplified explanation of how it's computed, based on the methodology described:

Protocol for computing $ID$

1. Syllable Probabilities

  • Collect a large written corpus for each language (e.g., texts, lexical databases).
  • Transcribe the corpus phonetically and segment it into syllables (using rule-based programs or existing syllabification for some languages).
  • For each language, calculate:
    • Unigram probabilities: $p(x)$, the probability of each syllable $x$ occurring in the corpus.
    • Bigram probabilities: $p(x, y)$, the probability of a syllable $y$ following a syllable $x$ within the same word (or a null marker for word-initial syllables).

2. First-Order Entropy ($ShE$)

  • Compute the standard Shannon entropy for syllables (unigram-based): $$ ShE = -\sum_{x} p(x) \cdot \log_2(p(x)) $$
    • $p(x)$: Probability of syllable $x$.
    • This measures the average uncertainty or information content per syllable, ignoring context.

3. Second-Order Entropy ($ID$)

  • Compute the conditional entropy to account for syllable dependencies (bigrams within words): $$ ID = -\sum_{x, y} p(x, y) \cdot \log_2\left(\frac{p(x, y)}{p(x)}\right) $$
    • $p(x, y)$: Joint probability of syllable $y$ following syllable $x$.
    • $\dfrac{p(x, y)}{p(x)}$: Conditional probability $p(y|x)$, the likelihood of $y$ given $x$.
    • This reflects the information content per syllable, considering the context of the previous syllable, making it a more accurate measure of linguistic information.

4. Information Rate ($IR$)

  • Multiply the ID (bits per syllable) by the Speech Rate (SR) (syllables per second): $$ IR = ID \cdot SR $$
    • $SR$ is calculated as the number of syllables ($NS$) divided by the duration of speech (in seconds, excluding pauses >$150$ ms).

Key Notes

  • The data source: Large written corpora provide syllable frequencies, while spoken corpora ($170$ speakers, $17$ languages, $\sim 240,000$ syllables) provide $SR$.
  • Why Conditional Entropy?: It accounts for syllable predictability within words, reducing redundancy compared to first-order entropy ($ShE$), which assumes syllables are independent.
  • Result: Across $17$ languages, $ID$ varies (e.g., $4.8$ bits/syllable for Basque to $8.0$ for Vietnamese), but $IR$ converges around $\sim 39$ bits/s due to a trade-off between $ID$ and $SR$.

For detailed implementation, the study provides R code and data in the GitHub repository: https://github.com/keruiduo/SupplMatInfoRate.

Maths.pm  ne collecte aucune donnée.
  • Aucun cookie collecté
  • Aucune ligne de log écrite
  • Pas l'ombre d'une base de données distante
  • nihil omnino

  • Ni par pointcarre.app
  • Ni par notre hébergeur
  • Ni par aucun service tiers

Nous expliquons notre démarche zéro donnée conservée sur cette page.

Maths.pm, par

pointcarre.app

Codes sources
Logo licence AGPLv3
Contenus
Logo licence Creative Commons