Can you pick up the ‘core’ of ten languages in a year?

Biocode: The New Age of Genomics

Buy Now

By Dawn Field
February 5^th 2018

I previously wrote about how Scientific English is a specialized form of language used in formal presentations and publications. It is rich in ‘rare’, or extremely low frequency words, and the colocations that define them (i.e. we ‘sequence a genome’ or ‘stretch of ‘DNA’). Learning to comprehend the meaning of such formal language requires considerable exposure and writing it well truly exercises one’s knowledge of the ‘long tail’ of vocabulary.

By contrast, ‘the’ is the most common word in English and we use ‘a’ and ‘or’ in the above examples all the time. High and low frequency words can easily be identified by getting computers to do the heavy lifting counting and frequencies vary considerably. In the 560-million-word Corpus of Contemporary American English (COCA) (the list of the first 5,000 words is free) one can look up the rank of any word. In this corpus, for example, ‘attic’ is 7309, ‘unsparing’ is 47309, and ‘embryogenesis’ is 57309.

This considerable variation has many ramifications for language acquisition and use. A developmental biologist might require the word ‘embryogenesis’, but no one would think of it as the first word to learn in another language (that word is actually thank you). Of course, context defines utility, but in general ‘frequency’ is a hard and fast rule: one will encounter ‘cat’ long before ‘catnip’, ‘dog’ before ‘dogcatcher’, and cat and dog repeatedly before ‘armadillo, aardvark, or baleen whale’.

The nature of the word frequency curve offers good and bad news to language learners. On the positive side, analysis of the distribution of frequencies yields a surprising statistic: the 100 most common words in English make up all 50% of spoken and written language. One can get a long way by learning the front of the curve.

The flip side is that the frequency curve gets harder precipitously. The ‘lexical’ words of language (e.g. nouns and verbs that carry information) are numerous and drop off fast in use. The ‘diminishing returns’ statistic states that with the most common 1,000 words you’ll be able read 75% of most texts, with 2,000 you’ll be at 85%, but another 1,000 only adds less than 2% more.

Getting to native level means learning thousands of words that are hardly used. Authors of literature take full advantage of this fact to find exactly the right, colourful words, thus flavouring their creative works. This steep learning curve is what makes the mastery of scientific language by non-native speakers such an achievement.

Similar patterns of low and high frequency ‘words’ are found in DNA, as the rule of ‘a few common and many rare’ is a fundamental part of how nature works. An integral part of the study of DNA is the use of computers to mine such patterns. My field of research, bioinformatics, works at the intersections of computing, biology, math, statistics, and data management. This domain exploded into existence because of the advent of genomics and given the wealth of data and speed of progress adopted an ‘open source’ ethos: the collective view that the benefit of freely available and shared data and software is accelerated discovery.

Language by monika1607. CC0 public domain via Pixabay.

With such thoughts, I recently tackled Swedish while living in Sweden as a guest professor. I’d dabbled in other languages, including living in Japan for a year teaching English, so I was well-versed in failing to get language to stick. Just before moving, I spent a year self-studying Hindi. I was intrigued because it is one of the world’s ‘big five’ languages, a lingua franca in India, derives from the ancient language of Sanskrit, and is a well-spring of wisdom for both Greek and Latin.

Attempting Hindi ‘for fun’ opened my eyes to how the Internet has revolutionized access to native speaking teachers all over the world and the wealth of online materials they produce. So, I proceeded with my life-long language ‘experiment’ aiming to learn how to read basic Swedish as quickly as possible using free, online materials. I collected up Swedish words I saw most frequently around me in a ‘word log’ and trawled YouTube for productive listening videos to gain an overview of Swedish grammar. I could read Easy Swedish News at 90%-100% comprehension within two months, about a 1,500-word recognition vocabulary.

Mostly, it offered a chance to think more deeply about how we learn language best. Working in a frequency-based way, allowed a ‘memorization-free’ philosophy. I bulked up quickly by curating vocabulary lists, but I also consumed ‘real world’ materials (from ABBA songs in Swedish to Facebook ads). Picking my resources carefully meant I saw the same words over and over. I didn’t worry that I didn’t know a word until I’d interacted with it ten times, what I think of as the “10x” rule. I used the same method for Spanish, a cognate pool more similar to English, and with even more online resources, in only two weeks. I can’t speak a word of these languages, but satisfyingly, I feel I got a ‘flavour’ of each. I also feel just that bit more the ‘global citizen’.

My ‘linguistic tourist’ experience was overshadowed by the fact that so many teaching sources break the ‘frequency’ rule and few use it. I kept thinking how much faster I could learn with graded, interlinked resources. Ruminating this, I most recently forayed into Estonian, a language more foreign (and therefore interesting) to English speakers than Hindi because it is outside the Indo-European language family. The experience struck home a blindingly obvious fact that we overlook to our peril. How much time is spent learning the English – it is always different and knowing it is fundamental to learning the new language. This goes deeper than wishing for ‘uniform’ materials: there is a ‘true core’ to language, and we know scientifically that it is frequency based. Plus, polyglots swear the trick to language is caring about the words you are learning in the first place.

What if the languages community agreed upon a shared ‘core’ hosted in the public domain and built resources around it? Could such an experiment one day support a crazy ambition to learn the ‘flavour’ of ten languages in a year? Let’s just say the first 1,000 words for reading?

A free and open “First Words” list could be built on in infinite ways, thus making it easy to learn memorization-free by the “10x” rule. My first wish would be a choice selection from the list of ‘100 words for speaking’ engineered to cover sentence construction and grammar with a view to getting one speaking. Prioritizing the ‘5Ws and H’ the focus would be on beginner statements (“My name is…”), forming questions to support dialogues, the greetings, a few power nouns and verbs, such as ‘to be’ and ‘have’ and key glue words (the, of, and, or, but, etc). Even 20 are sufficient to cover pronunciation, the basics of sentence formation, first grammar rules, and support simple dialogue, as an over-simplified illustration of how a ‘language works’.

While dreaming, why stop at 1,000 words? At 2,000 words one is pretty much ‘fluent’ in daily conversation and at 5,000 can make good sense of a newspaper. There is no reason, in theory, it could not include the whole language (dictionaries) right up to the complexities of Scientific English. If you would like to collaborate, please contact me at unityinwriting(at)gmail(dot)com.

Featured image credit: Quote by Maialisa. CC0 public domain via Pixabay.

Dawn Field, PhD, is the author of Biocode: The New Age of Genomics (OUP, 2015). She is a Senior Research Fellow at the NERC Centre for Ecology and Hydrology, a Research Associate of the Biodiversity Institute of Oxford at Oxford University, and a Research Associate of the Smithsonian Institution. Dawn is also currently a Lamberg International Guest Professor at Göteborg University, Sweden. She is a founder of the Genomic Standards Consortium, the Genomic Observatories Network, and Ocean Sampling Day. Follow her on Twitter @fiedawn. You can view Dawn's other blog posts via her column.

CF_DoubleHelixHeader_051315

Biocode: The New Age of Genomics

Related posts:

Recent Comments