I previously wrote about how Scientific English is a specialized form of language used in formal presentations and publications. It is rich in ‘rare’, or extremely low frequency words, and the colocations that define them (i.e. we ‘sequence a genome’ or ‘stretch of ‘DNA’). Learning to comprehend the meaning of such formal language requires considerable exposure and writing it well truly exercises one’s knowledge of the ‘long tail’ of vocabulary.
By contrast, ‘the’ is the most common word in English and we use ‘a’ and ‘or’ in the above examples all the time. High and low frequency words can easily be identified by getting computers to do the heavy lifting counting and frequencies vary considerably. In the 560-million-word Corpus of Contemporary American English (COCA) (the list of the first 5,000 words is free) one can look up the rank of any word. In this corpus, for example, ‘attic’ is 7309, ‘unsparing’ is 47309, and ‘embryogenesis’ is 57309.
This considerable variation has many ramifications for language acquisition and use. A developmental biologist might require the word ‘embryogenesis’, but no one would think of it as the first word to learn in another language (that word is actually thank you). Of course, context defines utility, but in general ‘frequency’ is a hard and fast rule: one will encounter ‘cat’ long before ‘catnip’, ‘dog’ before ‘dogcatcher’, and cat and dog repeatedly before ‘armadillo, aardvark, or baleen whale’.
The nature of the word frequency curve offers good and bad news to language learners. On the positive side, analysis of the distribution of frequencies yields a surprising statistic: the 100 most common words in English make up all 50% of spoken and written language. One can get a long way by learning the front of the curve.
The flip side is that the frequency curve gets harder precipitously. The ‘lexical’ words of language (e.g. nouns and verbs that carry information) are numerous and drop off fast in use. The ‘diminishing returns’ statistic states that with the most common 1,000 words you’ll be able read 75% of most texts, with 2,000 you’ll be at 85%, but another 1,000 only adds less than 2% more.
Getting to native level means learning thousands of words that are hardly used. Authors of literature take full advantage of this fact to find exactly the right, colourful words, thus flavouring their creative works. This steep learning curve is what makes the mastery of scientific language by non-native speakers such an achievement.
Similar patterns of low and high frequency ‘words’ are found in DNA, as the rule of ‘a few common and many rare’ is a fundamental part of how nature works. An integral part of the study of DNA is the use of computers to mine such patterns. My field of research, bioinformatics, works at the intersections of computing, biology, math, statistics, and data management. This domain exploded into existence because of the advent of genomics and given the wealth of data and speed of progress adopted an ‘open source’ ethos: the collective view that the benefit of freely available and shared data and software is accelerated discovery.
With such thoughts, I recently tackled Swedish while living in Sweden as a guest professor. I’d dabbled in other languages, including living in Japan for a year teaching English, so I was well-versed in failing to get language to stick. Just before moving, I spent a year self-studying Hindi. I was intrigued because it is one of the world’s ‘big five’ languages, a lingua franca in India, derives from the ancient language of Sanskrit, and is a well-spring of wisdom for both Greek and Latin.
Attempting Hindi ‘for fun’ opened my eyes to how the Internet has revolutionized access to native speaking teachers all over the world and the wealth of online materials they produce. So, I proceeded with my life-long language ‘experiment’ aiming to learn how to read basic Swedish as quickly as possible using free, online materials. I collected up Swedish words I saw most frequently around me in a ‘word log’ and trawled YouTube for productive listening videos to gain an overview of Swedish grammar. I could read Easy Swedish News at 90%-100% comprehension within two months, about a 1,500-word recognition vocabulary.
Mostly, it offered a chance to think more deeply about how we learn language best. Working in a frequency-based way, allowed a ‘memorization-free’ philosophy. I bulked up quickly by curating vocabulary lists, but I also consumed ‘real world’ materials (from ABBA songs in Swedish to Facebook ads). Picking my resources carefully meant I saw the same words over and over. I didn’t worry that I didn’t know a word until I’d interacted with it ten times, what I think of as the “10x” rule. I used the same method for Spanish, a cognate pool more similar to English, and with even more online resources, in only two weeks. I can’t speak a word of these languages, but satisfyingly, I feel I got a ‘flavour’ of each. I also feel just that bit more the ‘global citizen’.
My ‘linguistic tourist’ experience was overshadowed by the fact that so many teaching sources break the ‘frequency’ rule and few use it. I kept thinking how much faster I could learn with graded, interlinked resources. Ruminating this, I most recently forayed into Estonian, a language more foreign (and therefore interesting) to English speakers than Hindi because it is outside the Indo-European language family. The experience struck home a blindingly obvious fact that we overlook to our peril. How much time is spent learning the English – it is always different and knowing it is fundamental to learning the new language. This goes deeper than wishing for ‘uniform’ materials: there is a ‘true core’ to language, and we know scientifically that it is frequency based. Plus, polyglots swear the trick to language is caring about the words you are learning in the first place.
What if the languages community agreed upon a shared ‘core’ hosted in the public domain and built resources around it? Could such an experiment one day support a crazy ambition to learn the ‘flavour’ of ten languages in a year? Let’s just say the first 1,000 words for reading?
A free and open “First Words” list could be built on in infinite ways, thus making it easy to learn memorization-free by the “10x” rule. My first wish would be a choice selection from the list of ‘100 words for speaking’ engineered to cover sentence construction and grammar with a view to getting one speaking. Prioritizing the ‘5Ws and H’ the focus would be on beginner statements (“My name is…”), forming questions to support dialogues, the greetings, a few power nouns and verbs, such as ‘to be’ and ‘have’ and key glue words (the, of, and, or, but, etc). Even 20 are sufficient to cover pronunciation, the basics of sentence formation, first grammar rules, and support simple dialogue, as an over-simplified illustration of how a ‘language works’.
While dreaming, why stop at 1,000 words? At 2,000 words one is pretty much ‘fluent’ in daily conversation and at 5,000 can make good sense of a newspaper. There is no reason, in theory, it could not include the whole language (dictionaries) right up to the complexities of Scientific English. If you would like to collaborate, please contact me at unityinwriting(at)gmail(dot)com.
Featured image credit: Quote by Maialisa. CC0 public domain via Pixabay.
Interesting subject! I have been studying languages since my early ages. So far, I did study ten languages. I have been quick to learn and still I am, but there is one reality. If I am not using the languages I have learned they go back to their storage/hiding boxes in my language centre of my brain.
Anyway, it is quite easy to re-learn them! It is also helpful to use the learned use of languages in your daily life, just for fun. That is another way to keep them awake.
Have a good use of the words.
The proposal sounds beguilingly simple, but I wonder how far the author got in reading Hindi or Japanese newspapers. Pronunciation may sound simple, but Westerners are not used to hearing and using differences in tone, which are crucial to Chinese and other southeast Asian languages. Coming from a scientist, the idea seemed at first to point to the international commonalty of scientific and institutional vocabulary (even some formal Russian is transparent for this reason), but everyday vocabulary is apt to be maximally different. Spending time learning Chinese characters is probably more practically beneficial for the future. 1,000 characters can go a long way.
Wikipedia has frequency lists for some languages, e. g. English and German. Some good search terms for scientific articles are lexicostatistics, linguostatistics, quantitative linguistics, corpus linguistics, Basic English, frequency dictionary. If you can read German, works (published) by Reinhard Köhler and Gabriel Altmann are a good source for quantitative linguistics, e.g. the series Glottometrics.
Always inspired by you Dawn. I was waiting for this article, and here it is. While developing my Hindi course, I have tried to keep in mind your guidelines, and it will be more optimized with time, actually, a lot of content were pre-made so did a lot of changes recently in January to my course, to count more frequency words.
It’s sad to see lots of languages learning programs/apps/websites with so many resources but very few or none of them are much prepared in frequency based. Even me I was more working in how to deliver lessons or information interactively, but until you came into, I was not aware of what to deliver interactively, so, I felt it would insensitive as language content producer. I am totally open to the changes and development of Hindi learning content in this way. I am so wanting to work with you in 10X Languages in a year project.
Meanwhile, wondering how we can get or create Hindi frequency words. Couldn’t get any in my hands yet.
Some people have a wonderful ease picking up languages! They can learn language almost any way it is served up. 1000 words is a taster, the point is being able to quickly move through a selection of languages and the point is to build a ‘core’ — just a starting point. Reading newspapers takes around 5000 words and you’ll still be looking up words. But if deep learning is your goal, you could still benefit from tackling the front of the word frequency curve and moving through it in order. Especially if lists were freely available matched to lots of interesting content.
Anil – always ready to work with you. I since used this method to get reading Portuguese in a week (with help) and have made my first Hindi spreadsheet! Since I learned Hindi from Bollywood I have a vocabulary of low frequency words and so can’t say anything. But I know some simple things like ‘my name is’ and ‘what is this’ so now I need to fill in the gaps! Working on more articles.
As for frequency lists, I hope to write more on that in the future. I have started with COCA after looking around, but more on that in future articles – and why frequency is context-dependent!