Languages of science explained

Scientific languages are vehicular languages used by one or several scientific communities for international communication. According to science historian Michael Gordin, they are "either specific forms of a given language that are used in conducting science, or they are the set of distinct languages in which science is done."

Until the 19th century, classical languages such as Latin, Classical Arabic, Sanskrit, and Classical Chinese were commonly used across Afro-Eurasia for the purpose of international scientific communication. A combination of structural factors, the emergence of nation-states in Europe, the Industrial Revolution and the expansion of colonization entailed the global use of three European national languages: French, German and English. Yet new languages of science such as Russian or Italian had started to emerge by the end the 19th century, to the point that international scientific organizations started to promote the use of constructed languages like Esperanto as a non-national global standard.

After the First World War, English gradually outpaced French and German and became the leading language of science, but not the only international standard. Research in the Soviet Union rapidly expanded in the years following the Second World War, and access to Russian journals became a major policy issue in the United States, prompting the early development of Machine Translation. In the last decades of the 20th century, an increasing number of scientific publications used primarily English, in part due to the preeminence of English-speaking scientific infrastructures, indexes and metrics like the Science Citation Index. Local languages still remain largely relevant scientificly in major countries and world regions such as China, Latin America, and Indonesia. Disciplines and fields of study with a significant degree of public engagement such as social sciences, environmental studies, and medicine also have a maintained revelance of local languages.

The development of open science has revived the debate over linguistic diversity in science, as social and local impact has become an important objective of open science infrastructures and platforms. In 2019, 120 international research organizations co-signed the Helsinki Initiative on Multilingualism in Scholarly Communication and called for supporting multilingualism and the development of "infrastructure of scholarly communication in national languages".[1] The 2021 Unesco Recommendation for Open Science includes "linguistic diversity" as one of the core features of open science, as it aims to "make multilingual scientific knowledge openly available, accessible and reusable for everyone." In 2022, the Council of the European Union officially supported "initiatives to promote multilingualism" in science, such as the Helsinki declaration.

History

From classical languages to vernaculars

Until the 19th century, classical languages played an instrumental role in the diffusion of languages in Europe, Asia and North Africa.

In Europe, starting in the 12th century, Latin was the primary language of religion, law and administration until the Early Modern period. It became a language of science "through its encounter with Arabic"; during the Renaissance of the 12th century, a large corpus of Arabian scholarly texts was translated into Latin, in order for it to be available in the emerging network of European universities and centers of knowledge. In this process, the Latin language changed, and acquired the specific features of scholastic Latin, through numerous lexical and even syntactic borrowings from Greek and Arabic. The use of scientific Latin persisted long after the replacement of Latin by vernacular languages in most European administrations: "Latin's status as a language of science rested on the contrast it made with the use of the vernacular in other contexts" and created "a European community of learning" entirely distinct from the local communities where the scholars lived. Latin never was the sole language of science and education. Beyond local publications, vernaculars very early attained a status of international scientific languages, that could be expected to be understood and translated across Europe. In the mid-16th century, a significant amount of printed output in France was in Italian.

In the Indian and South Asian region, Sanskrit was a leading vehicular language for science. Sanskrit has been remodeled even more radically than Latin for the purpose of scientific communication as it shifted "toward ever more complex noun forms to encompass the kinds of abstractions demanded by scientific and mathematical thinking." Classical Chinese held a similarly prestigious position in East Asia, being largely adopted by scientific and Buddhist communities beyond the Chinese Empire, notably in Japan and Korea.

Classical languages declined throughout Eurasia during the 2nd millennium. Sanskrit was increasingly marginalized after the 13th century. Until the end of the 17th century, there was no clear trend of displacement of Latin in Europe by vernacular languages: while in the 16th century, medical books started to use French as well; this trend was reversed after 1597 and most medical literature in France remained only accessible in Latin until the 1680s. In 1670, as many books were printed in Latin as in German in the German states; in 1787, they accounted for no more 10%. At this point, the decline became irreversible: since less and less European scholars were conversant with Latin, publications dwindled and there was less incentive to maintain linguistic training in Latin.

The emergence of scientific journals was both a symptom and cause of the declining use of a classical language. The first two modern scientific journals were published simultaneously in 1665: the Journal des Sçavans in France and the Philosophical Transactions of the Royal Society in England. They both used the local vernacular, which "made perfect historical sense" as both the Kingdom of France and the Kingdom of England were engaged in an active policy of linguistic promotion of the language standard.

French, English, German and the quest for an auxiliary language (1800–1920)

The gradual disuse of Latin opened an uneasy transition period as more and more works were only accessible in local languages. Many national European languages held the potential to become a language of science within a specific research field: some scholars "took measures to learn Swedish so they could follow the work of [the Swedish chemist] Bergman and his compatriots."

Language preferences and use across scientific communities were gradually consolidated into a triumvirate or triad of dominant languages of science: French, English and German. While each language would be expected to be understood for the purpose of international scientific communication, they also followed "different functional distributions evident in various scientific fields". French had been almost acknowledged as the international standard of European science in the late 18th century, and remained "essential" throughout the 19th century. German became a major scientific language within the 19th century as it "covered portions of the physical sciences, particularly physics and chemistry, plus mathematics and medicine." English was largely used by researchers and engineers, due to the seminal contribution of English technology to the Industrial Revolution.

In the years preceding the First World War, linguistic diversity of scientific publications increased significantly. The emergence of modern nationalities and early decolonization movements created new incentives to publish scientific knowledge in one's national language. Russian was one of the most successful developments of a new language of science. In the 1860s and 1870s, Russian researchers in chemistry and other physical sciences ceased to publish in German in favor of local periodicals, following a major work of adaptation and creation of names for scientific concepts or elements (such as chemical compounds). A controversy over the meaning of the periodic table of Dmitri Mendeleev contributed to the acknowledgement of original publications in Russian in the global scientific debate: the original version was deemed more authoritative than its first "imperfect" translation in German.

Linguistic diversity became framed as a structural problem that ultimately limited the spread of scientific knowledge. In 1924, the linguist Roland Grubb Kent underlined that scientific communication could be significantly disrupted in the near future by the use of as many as "twenty" languages of science:

The definition of an auxiliary language for science became a major issue discussed in the emerging international scientific institutions. On January 17, 1901, the newly established International Association of Academies created a Delegation for the Adoption of an International Auxiliary Language "with support from 310 member organizations". The Delegation was tasked to find an auxiliary language that could be used for "scientific and philosophical exchanges" and could not be any "national language". In the context of increased nationalistic tensions any of the dominant languages of science would have appeared as a non-neutral choice. The Delegation had consequently a limited set of options that included the unlikely revival of a classical language like Latin or a new constructed language such as Volapük, Idiom Neutral or Esperanto.

Throughout the first part of the 20th century, Esperanto was seriously considered as a potential international language of science. As late as 1954, UNESCO passed a recommendation to promote the use of Esperanto for scientific communication. In contrast with Idiom Neutral, or the simplified version of Latin, Interlingua, Esperanto was not primarily conceived as a scientific language. Yet, by the early 1900s, it was by far the most successful constructed language, with a large international community as well as numerous dedicated publications. Starting in 1904, the Internacia Science Revuo aimed to adapt Esperanto to the specific needs of scientific communication. The development of a specialized technical vocabulary was a challenging task, as the extensive system of derivation of Esperanto made it complicated to import directly words commonly used in German, French or English scientific publications. In 1907, the Delegation for the Adoption of an International Auxiliary Language seemed close to retaining Esperanto as its preferred language. Significant criticism was nevertheless still addressed at a few remaining complexities of the language as well as its lack of scientific purpose and technical vocabulary. Unexpectedly, the Delegation supported a new variant of the Esperanto, Ido, which was submitted very late in the process by an unknown contributor. While it was framed as a compromise between the esperantist and the anti-esperantist factions, this decision ultimately disappointed all the proponents of an international medium for scientific communication and durably harmed the adoption of constructed languages in academic circles.

A transition period: English, new competitors and machine translation (1920–1965)

The two world wars had a lasting impact on scientific languages. A combination of political, economic and social factors durably weakened the triumvirate of the three main languages of science in 19th century and paved the way for the domination in English in the latter part of the 20th century. There is still ongoing debate as to whether the world wars accelerated a structural tendency toward English predominance or merely created the conditions for it. For Ulrich Ammon, "even without the World Wars the English language community would have gained economic and, consequently, scientific superiority and, thus, preference of its language for international scientific communication." In contrast, Michael Gordin underlines that until the 1960s the privileged status of English was far from settled.

The First World War had an immediate impact on the global use of German in academic settings. For nearly a decade after the First World War, German researchers were boycotted by international scientific events. The German scientific communities had been compromised by nationalistic propaganda in favor of German science during the war, as well as by the exploitation of scientific research for war crimes. German was no longer acknowledged as a global scientific language. While the boycott did not last, its effects were long-term. In 1919 the International Research Council was created to replace the International Association of Academies and used only French and English as working languages. In 1932, almost all (98.5%) of international scientific conferences admitted contributions in French, 83.5% in English and only 60% in German. In parallel, the focus of German periodicals and conferences had become increasingly local, and less and less frequently included research from non-Germanic countries. German never recovered its privileged status as a leading language of science in the United States, and due to the lack of alternatives beyond French, American education became "increasingly monoglot" and isolationist. Not affected by international boycott, the use of French reached "a plateau between the 1920s and 1940s": while it did not decline, neither did it profit from the marginalization of German, but instead decreased relative to the expansion of English.

The rise of totalitarianism in the 1930s reinforced the status of English as the leading scientific language. In absolute terms German publications retained some relevance, but German scientific research was structurally weakened by anti-Semitic and political purges, rejection of international collaborations and emigration. The German language was not boycotted again in international scientific conferences after the Second World War, as its use had quickly become marginal, even in Germany itself: even after the end of the occupied zone, English in the West and Russian in the East became major vehicular languages for higher education.

In the two decades following the Second World War, English had become the leading language of science. However, a large share of global research continued to be published in other languages, and language diversity even seemed to increase until the 1960s. Russian publications in numerous fields, especially chemistry and astronomy, had grown rapidly after the war: "in 1948, more than 33% of all technical data published in a foreign language now appeared in Russian." In 1962, Christopher Wharton Hanson still raised doubts about the future of English as the leading language in science, with Russian and Japanese rising as major languages of science and the new decolonized states seemingly poised to favor local languages:

The expansion of Russian scientific publication became a source of recurring tensions in the United States during the decade of the cold war. Very few American researchers were able to read Russian which contrasted with a still widespread familiarity in the two oldest languages of science, French and German: "In a 1958 survey, 49% of American scientific and technical personnel claimed they could read at least one foreign language, yet only 1.2% could handle Russian." Science administrators and funders had recurring fears that they were not able to track efficiently the progress of academic research in the URSS. This ongoing anxiety became an overt crisis after the successful launch of Sputnik in 1958, as the decentralized American research system seemed for a time outpaced by the efficiency of Soviet planning.

Although the Sputnik crisis did not last long, it had far reaching consequences for linguistic practices in science: in particular, the development of machine translation. Research in this area emerged very precociously: automated translation appeared as a natural extension of the initial purpose of the first computers: code-breaking. Despite the initial reluctance of leading figures in computing like Norbert Wiener, several well-connected science administrators in the US, like Warren Weaver and Léon Dostert, set up a series of major conferences and experiments in the nascent field, out of a concern that "translation was vital to national security". On January 7, 1954, Dostert coordinated the Georgetown–IBM experiment, which aimed to demonstrate that the technique was sufficiently mature despite the significant shortcomings of the computing infrastructure of the time: some sentences from Russian scientific articles were automatically translated using a dictionary of 250 words and six basic syntax rules. It was not made clear at the time that the sentences had been purposely selected for their fitness for automated translation. At most Dostert argued that "scientific Russian" was easier to translate since it was more formulaic and less grammatically diverse than day-to-day Russian.

Machine translation became a major priority in Federal research funding in 1956 due to an emerging arms race with Soviet researchers. While the Georgetown–IBM experiment did not have a large impact at first in the United States, it was immediately noticed in the USSR. The first articles in the field appeared in 1955; and only one year later, a major conference was held attracting 340 representatives. In 1956, Léon Dostert secured a large funding with the support of the CIA and had enough resources to overcome the technical limitations of existing computing infrastructure: in 1957, automated translation from Russian to English could run on a vastly expanded dictionary of 24,000 words and rely on hundreds of predefined syntax rules. At this scale, automated translation remained costly as it relied on numerous computer operators using thousands of punch cards. Yet the quality of the output did not progress significantly: in 1964, the automated translation of the few sentences submitted during the Georgetown–IBM experiment yielded a much less readable output, as it was no longer possible to tweak the rules on a predefined corpus.

English as a global standard (1965 onwards)

During the 1960s and the 1970s, English was no longer a majority language of science but a scientific lingua franca. The transformation had more wide-ranging consequences than the substitution or two or three main language of science by one language: it marked "the transition from a triumvirate that valued, at least in a limited way, the expression of identity within science, to an overwhelming emphasis on communication and thus a single vehicular language." Ulrich Ammon characterizes English as an "asymmetrical lingua franca", as it is "the native tongue and the national language of the most influential segment of the global scientific community, but a foreign language for the rest of the world." This paradigm is usually connected with the globalization of American and English-speaking culture in the later part of the 20th century.

No specific event accounts for the entire shift although numerous transformations highlight an accelerated conversion to English science in the later part of the 1960s. On June 11, 1965, President Lyndon B. Johnson acted that the English language has become a lingua franca that opened "doors to scientific and technical knowledge" and whose promotion should be a "major policy" of the United States. In 1969, the most prestigious abstract collection in chemistry of the early 20th century, the German Chemisches Zentralblatt disappeared: this polyglot compilation in 36 languages could no longer compete with the English-focused Chemical abstract as more than 65% of publications in the field were in English. By 1982, the Compte-rendu of the Académie des Sciences admitted that "English is by now the international standard language of science and it could very nearly become its unique language" and is already the main "mean of communication" in European countries with a long-standing tradition of publication in the local language like Germany and Italy. In the European Union, the Bologna Declaration of 1999 "obliged universities throughout Europe and beyond to align their systems with that of the United Kingdom" and created strong incentives to publish academic results in English. From 1999 to 2014, the number of English-speaking course in European universities increased ten-fold.

Machine translation, which has been booming since 1954 thanks to Soviet-American competition, was immediately affected by the new paradigm. In 1964, the National Science Foundation underlined that "there is no emergency in the field of translation" and that translators were easily up to the task of making foreign research accessible. Funding stopped simultaneously in the United States and the Soviet Union and Machine Translation did not recover from this research "winter" until the 1980s and, by then, the translation of scientific publications was no longer the main incentive. Research in this area was still pursued in a few countries where bilingualism was an important political and cultural issue: in Canada, a METEO system was successfully set up to "translate weather forecasts from English into French".

English content became gradually prevalent in originally non-English journals, first as an additional language and then as the default language. In 1998, seven leading European journals published in their local languages (Acta Physica Hungarica, Anales de Física, Il Nuovo Cimento, Journal de Physique, Portugaliae Physica and Zeitschrift für Physik) merged and become the European Physical Journal, an international journal only accepting English submissions. The same process occurred repeatedly in less prestigious publications:

Early scientific infrastructures have been a leading factor in the conversion to a single vehicular languages. Critical developments in applied scientific computing and information retrieval system occurred in the United States after the 1960s. The Sputnik crisis has been the main incentive, as it "turned the librarians’ problem of bibliographic control into a national information crisis." and favored ambitious research plans like SCITEL (an ultimately failed proposal to create a centrally planned system of electronic publication in the early 1960s), MEDLINE (for medicine journals) or NASA/RECON (for astronomics and engineering). In contrast with the decline of Machine Translation, scientific infrastructure and database became a profitable business in the 1970s. Even before the emergence of global network like the World Wide Web, "it was estimated in 1986 that fully 85% of the information available in worldwide networks was already in English."

The predominant use of English was not limited to the architecture of networks and infrastructures but affected the content as well. The Science Citation Index created by Eugene Garfield on the ruins of the SCITEL had a massive and lasting influence on the structure of global scientific publication in the last decades of the 20th century, as its most important metrics; the Journal Impact Factor, "ultimately came to provide the metric tool needed to structure a competitive market among journals." The Science Citation Index had a better coverage of English-speaking journals which yielded them a stronger Journal Impact Factor and created incentives to publish in English: "Publishing in English placed the lowest barriers toward making one’s work "detectable" to researchers." Due to the convenience of dealing with a monolingual corpus, Eugene Garfield called for acknowledging English as the only international language for science:

Current trends

English standardization

Nearly all the scientific publications indexed on the leading commercial academic search engines are in English. In 2022, this concerns 95.86% of the 28,142,849 references indexed on the Web of Science and 84.35% of the 20,600,733 references indexed on Scopus.

The lack of coverage of non-English languages creates a feedback loop as non-English publications can be held less valuable since they are not indexed in international rankings and fare poorly in evaluation metrics. As many as 75,000 articles, book titles and book reviews from Germany were excluded from Biological abstracts from 1970 to 1996. In 2009, at least 6555 journals were published in Spanish and Portuguese on a global scale and "only a small fraction are included in the Scopus and Web of Science indices."

Criteria for inclusion in commercial databases not only favor English journals but incentivize non-English journals to give up on their local journals. They "demand that articles be in English, have abstracts in English, or at least have their references in English". In 2012, the Web of Science was explicitly committed to the anglicization (and romanization) of published knowledge:

This commitment toward English science has a significant performative effect. Commercial databases "now wield on the international stage is considerable and works very much in favor of English" as they provide a wide range of indicators of research quality. They contributed "large-scale inequality, notably between Northern and Southern countries". While leading scientific publishers had initially, "failed to grasp the significance of electronic publishing," they have successfully pivoted to a "data analytics business" by the 2010s. Actors like Elsevier or Springer are increasingly able to control "all aspects of the research lifecycle, from submission to publication and beyond" Due to this vertical integration, commercial metrics are no longer restricted to journal article metadata but can include a wide range of individual and social data extracted among scientific communities.

National databases of scientific publications shows that the use English has continued to expand in the 2000s and the 2010s at the expense of local language. A comparison of seven national database in Europe from 2011 to 2014 shows that in "all countries, there was a growth in the proportion of English publications". In France, data from the Open Science Barometer shows that the share of publication in French has shrunk from 23% in 2013 to 12-16% by 2019–2020.[2]

For Ulrich Ammon the predominance of English has created a hierarchy and a "central-peripheral dimension" within the global scientific publication landscape, that affects negatively the reception of research published in a non-English language. The unique use of English has discriminating effects on scholar who are not sufficiently conversant in the language: in a survey organized in Germany in 1991, 30% of researchers in all disciplines gave up on publication whenever English was the only option. In this context, the emergence of new scientific powers is no longer linked with the apparition of a new language science as it used to be the case until the 1960s. China has fast become a major player in international research, ranking second behind the United States in numerous rankings and disciplines. Yet, most of this research is English-speaking and abide to the linguistic norms set up by commercial indexes.

The dominant position of English has also been strengthened by the "lexical deficit" accumulated through the past decades by alternative language of sciences: after the 1960s "new terms were being coined in English at a much faster rate than they were being created in French."

Persistence of linguistic diversity

Several languages have kept a secondary status of international language of science, either due to the extent of the local scientific production or to their continued use as a vehicular language in specific contexts. This includes generally "Chinese, French, German, Italian, Japanese, Russian, and Spanish." Local languages have remained prevalent in major scientific countries: "most scientific publications are still published in Chinese in China".

Empirical studies of the use of languages in scientific publications have long been constrained by structural bias in the most readily accessible sources: commercial databases like the Web of Science. Unprecedented access to larger corpus not covered by global index showed that multilingualism remain non-negligible, although it remains little studied: by 2022 there are "few examples of analyses at scale" of multilingualism in science. In seven European countries with a limited international reach of the local language, one third of researcher in Social Sciences and the Humanities publishes in two different languages or more: "research is international, but multilingual publishing keeps locally relevant research alive with the added potential for creating impact." Due to the discrepancy between the actual practices and their visibility, multilingualism has been described as a "hidden norm of academic publication".

Overall, the social sciences and the humanities have preserved more diverse linguistic practices: "while natural scientists of any linguistic background have largely shifted to English as their language of publication, social scientists and scholars of the humanities have not done so to the same extent." In these disciplines, the need for global communication is balanced by an implication in local culture: "the SSH are typically collaborating with, influencing and improving culture and society. To achieve this, their scholarly publishing is partly in the native languages." Yet, the specificity of the social science and the humanities has been increasingly reduced after 2000: by the 2010s, a large proportion of German and French articles in art and the humanities indexed in the Web of Science were in English. While German has been outpaced by English even in Germanic-speaking countries since the Second World War, it has also continued to be used marginally as a vehicular scientific language in specific disciplines or research fields (the Nischenfächer or "niche-disciplines"). Linguistic diversity is not specific to social sciences but this persistence may be invisibilized by the high prestige attached to international commercial databases: in the Earth sciences, "the proportion of English-language documents in the regional or national databases (KCI, RSCI, SciELO) was approximately 26%, whereas virtually all the documents (approximately 98%) in Scopus and WoS were in English."

Beyond the generic distinction between social sciences and natural sciences, there are finer-grained distribution of language practices. In 2018, a bibliometric analysis of the publications of eight European countries in social sciences and the humanities (SSH) highlighted that "patterns in the language and type of SSH publications are related not only to the norms, culture, and expectations of each SSH discipline but also to each country’s specific cultural and historic heritage." Use of English was more prevalent in Northern Europe than in Eastern Europe and publication in the local languages remain especially significant in Poland due to a large "‘local’ market of academic output". Local research policies may have a significant impact as preference for international commercial database like Scopus or the Web of Science may account for a steeper decline of publications in the local language in the Czech Republic, in comparison with Poland. Additional factors include the distribution of economic model within the journals: non-commercial publications have a much stronger "language diversity" than commercial publications.

Since the 2000s, the expansion of digital collections had contributed to a relative increase in linguistic diversity academic indexes and search engines. The Web of Science enhanced its regional coverage during the 2005-2010 period, which had the effect to "increase the number of non-English papers such as Spanish papers". In the Portuguese research communities, there have been a steep rise of Portuguese-language papers during the 2007-2018 period in commercial indexes which is both indicative of remaining "spaces of resilience and contestation of some hegemonic practices" and of a potential new paradigm of scientific publishing "steered towards plurilingual diversity". Multilingualism as a practice and competency has also increased: in 2022, 65% of early career researchers in Poland have published in two or more languages whereas only 54% of the older generations have done so.

In 2022, Bianca Kramer and Cameron Neylon have led a large scale analysis of the metadata available for 122 millions of Crossref objects indexed by a DOI. Overall, non-English publications make up for "less than 20%", although they can be under-estimated due to a lower adoption rate of DOIs or the use of local DOIs (like the Chinese National Knowledge Infrastructure). Yet, multilingualism seem to have improved through the past 20 years, with a significant growth of publication in Portuguese, Spanish and Indonesian.

Machine translation

Scientific publication has been the first major use case of machine translation with early experiments going back to 1954. Developments in this area were slowed after 1965, due to the increasing domination of English, the limitations of the computing infrastructure, and the shortcomings of the leading approach, rule-based machine translation. Rule-based methods favored by design translations between a few major languages (English, Russian, French, German...), as a "transfer module" had to be developed for "each pair of languages" which quickly led to a combinatory explosions whenever more languages were contemplated. After the 1980s, the field of Machine Translation was revived as it underwent a "full-scale paradigm shift": explicit rules were replaced by statistical and machine learning methods applied to large aligned corpus. By then, most of the demand stemmed non longer from scientific publication but from commercial translations such as technical and engineering manuals. A second paradigm shift occurred in the 2010s, with the development of deep learning methods, that can be partially trained on non-aligned corpus ("zero-shot translation"). Requiring little supervision inputs, deep learning models makes it possible to incorporate a wider diversity of languages, but also a wider diversity of linguistic contexts within one language. The results are significantly more accurate: after 2018, the automated translation of PubMed abstracts was deemed better than human translation for a few languages (like English to Portuguese). Scientific publications are a rather fitting use case for neural-network translation model since they work best "in restricted fields for which it has a lot of training data."

In 2021, there were "few in-depth studies on the efficiency of Machine Translation in social science and the humanities" as "most research in translation studies are focused on technical, commercial or law texts". Uses of machine translation are especially difficult to estimate and ascertain, as freely accessible tools like Google Translate have become ubiquitous: "There is an emerging yet rapidly increasing need for machine translation literacy among members of the scientific research and scholarly communication communities. Yet in spite of this, there are very few resources to help these community members acquire and teach this type of literacy."

In an academic setting, machine translation covers a variety of uses. Production of written translations remain constrained by a lack of accuracy and, consequently, of efficiency, as the post-editing of an imperfect translation needs to take less time than human translation. Automated translation of foreign language text in the context of literature survey or "information assimilation" is more widespread, as the quality requirements are generally lower and a global understanding of a text is sufficient. The impact of machine translation on linguistic diversity in science depends on these use:

Increased use machine translation has created concerns of "uniform multilingualism". Research in the field has largely been focused on English and a few major European languages: "While we live in a multilingual word, this is paradoxically not taken into account by machine translation". English has frequently been used as a "pivotal" language and served as a hidden intermediary state for the translation of two non-English languages. Probabilistic methods tend to favor the most expected possible translation from the training corpus and to rule out more unusual alternatives: "A common argument against the statistical methods in translation is that when the algorithm suggests the most probable translation, it eliminates alternative options and makes the language of the text so produced conform to well-documented modes of expression." While deep learning models are able to deal with a wider diversity of language construct, they can still be limited by collection bias of the original corpus: "the translation of a word can be affected by the prevailing theories or paradigms in the corpus harvested to train the AI".

In its 2022 research assessment of open science, the Council of European Union welcomed the "promising developments that have recently emerged in the area of automatic translation" and supported a more widespread use of "semi-automatic translation of scholarly publications within Europe" due to its "major potential in terms of market creation".

Open science and multilingualism

Open science infrastructures

The development of open science infrastructure or "community-controlled infrastructure" has become a major policy issue of the open science movement. In the 2010s expansion of commercial scientific infrastructure created a large acknowledgment of the fragility of open scholarly publishing and open archives. The concept of open science infrastructure emerged in 2015 with the publication of Principles for Open Scholarly Infrastructures. In November 2021, the UNESCO Recommendation acknowledged open science infrastructure as one of the four pillar of open science, along with open science knowledge, open engagement of societal actors and open dialog with other knowledge system and called for sustained investment and funding: "open science infrastructures are often the result of community-building efforts, which are crucial for their long-term sustainability and therefore should be not-for-profit and guarantee permanent and unrestricted access to all public to the largest extent possible." Examples of Open science infrastructure include indexes, publishing platforms, shared databases or computer grids.

Open infrastructures have supported linguistic diversity in science. The leading free software for scientific publishing, Open Journal Systems, is available in 50 languages[3] and is widespread among non-commercial open access journals. A landscape study of SPARC in 2021 shows that European open science infrastructures "provide access to a range of language content of local and international significance." In 2019, leading open science infrastructure have endorsed the Helsinki Initiative on Multilingualism in Scholarly Communication and thus committed to "protect national infrastructures for publishing locally relevant research."[1] Signatories include the DOAJ, DARIAH, LATINDEX, OpenEdition, OPERAS or SPARC Europe.[4]

In contrast with commercial index, the Directory of Open Access Journals does not prescribe the use of English. Consequently, only half of the journals indexed are primarily published in English, which comes in stark contrast with the large prevalence of English in commercial indexes like Web of Science (> 95%). Six languages are represented by more than 500 journals: Spanish (2776 journals, or 19.3%), Portuguese (1917 journals), Indonesia (1329 journals), French (993 journals), Russian (733 journals) and Italian (529 journals). Most of the language diversity is due to non-commercial journals (or diamond open access): 25.7% of these publications accept contributions in Spanish vs. only 2.4% of APC-based journals. On the 2020-2022 period, "for English articles in DOAJ journals, 21% are in non-APC journals, but for articles in languages other than English, this percentage is a massive 86%."

Non-English open infrastructures have experimented a significant growth: in 2022, "national repositories and databases are growing everywhere (see the databases such as Latindex in Latin America, or the new repositories in Asia, China, Russia, India)". This development opens up new research opportunities for the study of multilingualism in a scientific context: it will become increasingly feasible to study " differences between locally published research in non-English speaking contexts and English-speaking international authors".

Multilingualism and social impact

Publication in open access platforms has created new incentives for publishing in a local language. In commercial indexes, non-English publications were penalized by the lack of international reception and had a significantly lower impact factor. Without a paywall, local language publication can find their own specific audience among a large non-academic public that may be less competent in English.

In the 2010s, quantitative studies have started to highlight the positive impact of local languages on the reuse of open access resources in varied national contexts such as Finland, Québec, Croatia or Mexico. A study of the Finnish platform Journal.fi shows that the audience of Finnish-speaking articles is significantly more diverse: "in case of the national language publications students (42%) are clearly the largest group, and besides researchers (25%), also private citizens (12%) and other experts (11%)". Comparatively, English-speaking publications attract mostly professional researchers. Due to the ease of access, open science platforms in a local language can also attain a more global reach. The French-Canadian journal consortium Érudit has mostly an international audience, with less than one third of the readers coming from Canada.

The development of a strong network of open science infrastructures in South America (such as Scielo or Redalyc) and the Iberian region has concurred to the resurgence of the Spanish and Portuguese language in international scientific communication: regional growth "may also be associated with the boom in open access publishing. Both Portuguese and Spanish (as well as Brazil and Spain) play important roles in open access publishing.

While multilingualism have been either neglected or even discriminated in commercial databases, it has been valued as a significant component of the social impact of open science platforms and infrastructure. In 2015, Juan Pablo Alperin introduced a systematic measure of social impact that highlighted the relevancy of scientific content for local communities : "By looking at a broad range of indicators of impact and reach, far beyond the typical measures of one article citing another, I argue, it is possible to gain a sense of the people that are using Latin American research, thereby opening the door for others to see the ways in which it has touched those individuals and communities. In this context, new indicators for linguistic diversity. Proposals include the PLOTE-index and the Linguistic Diversity Index. Yet, as of 2022, they have had "limited traction in the scholarly anglophone literature". Comprehensive indicators for the local impact of research remain largely non-existent: "many aspects of research cannot be measured quantitatively, especially its socio-cultural impact."

Policies in favor of multilingualism

A new scientific and policy debate over linguistic diversity emerged after 2015: "in recent years, policies for Responsible Research and Innovation (RRI) and Open Science call for increasing access to research, interaction between science and society and public understanding of science". It initially stemmed from a wider discussion over the evaluation of open science and the limitations of commercial metrics: in 2015, the Leiden Manifesto issued ten principles to "guide research evaluation" that included a call to "protect excellence in locally relevant research". Building up on empirical data showing the persistence of non-English research communities in Europe, Gunnar Sivertsen has in 2018 theorized the need for a balanced multilingualism: "to consider all the communication purposes in all different areas of research, and all the languages needed to fulfil these purposes, in a holistic manner without exclusions or priorities." In 2016, Sivertsen contributed to the "Norwegian model" of scientific evaluation by proposing a flat hierarchy between a few large international journals and a wide selection of journals that would not discriminate against local publications, and encouraged journals in social sciences and the humanities to favor Norwegian publications.

These local initiatives developed into a new international movement in favor of multilingualism. In 2019, 120 research organizations and several hundred individual researchers co-signed Helsinki Initiative on Multilingualism in Scholarly Communication. The declaration include three principles:

  1. "Support dissemination of research results for the full benefit of the society", which implies that they should be available "in a variety of languages".
  2. "Protect national infrastructures for publishing locally relevant research" through a specific support of the non-commercial/diamond model "make sure not for-profit journals and book publishers have both sufficient resources". Non-commercial journals are more likely to be published in a local language.
  3. "Promote language diversity in research assessment, evaluation, and funding systems", in line with third recommendation of the Leiden Manifesto.

In the wake of the Helsinki Initiative, multilingualism has been increasingly associated to Open Science. This trend was accelerated in the context of the COVID pandemic, which "saw a widespread need for multilingual scholarly communication, not only between researchers, but to enable research to reach decision-makers, professionals and citizens". Multilingualism has also re-emerged as a topic of debate beyond the social sciences: in 2022, the Journal of Science Policy and Governance published a "Call to Diversify the Lingua Franca of Academic STEM Communities", that stressed that "cross-cultural solutions are necessary to prevent critical information from being missed by English-speaking researchers."

In November 2021, the UNESCO Recommendation for Open Science included multilingualism at the core of its definition of Open Science: "For the purpose of this Recommendation, open science is defined as an inclusive construct that combines various movements and practices aiming to make multilingual scientific knowledge openly available, accessible and reusable for everyone".

In the early 2020s, the European Union started to officially support language diversity in science, as a continuation of its general policies in favor of multilingualism. In December 2021, an important report of the European Commission on the future of scientific assessment in European countries still overlooked the issue of linguistic diversity: "Multilingualism is the most notable omission". In June 2022, the Council of the European Union included a detailed recommendation on "Development of multilingualism for European scholarly publications" in its research assessment of open science. The declaration acknowledges the "important role of multilingualism in the context of science communication with society" and welcomes "initiatives to promote multilingualism, such as the Helsinki initiative on multilingualism in scholarly communication." While the declaration is not constraining it invites the experiment with multilingualism "on a voluntary basis" and to assess the needs for further actions by the end of 2023.

Bibliography

Books & theses

Reports

Articles & chapters

Conferences

Other articles

Declaration

External links

Notes and References

  1. Web site: Helsinki Initiative on Multilingualism in Scholarly Communication . Helsinki Initiative on Multilingualism.
  2. Web site: Baromètre de la Science ouverte (général) . Open Science Barometer (general) . fr . Ministry of Higher Education, Research and Innovation.
  3. Web site: Language Dashboard . . Open Journal Systems.
  4. Web site: Signatories . Helsinki Initiative on Multilingualism.