Once upon a time in 2001, there was a small internet entity named Wikipedia which aspired to be a free online encyclopedia which everyone can contribute to. Initially, Wikipedia was in English but not long after, new Wikipedias emerge in other languages like German, Catalan, Japanese, Hebrew, French etc.. Wikipedia was not a small internet site anymore.
Like with humans, part of growing is to confront challenges that you had never knew existed. I will focus here on a rather uncomplicated challenge Wikipedia encountered; How do you link the same topic in different Wikipedias (that is, language additions)? The straight forward solution is fairly simple: if you have two articles about “banana” one in English and the second in Hebrew, then connect them with some code in one article that points to the article in the other language. This is exactly what Wikipedia has done until 2013 (don’t take my word on the exact date). They have created a template named interlangauge-links that is part of the page’s code and looks like this: [[en:banana]]. You can see the “en” is pointing to the wikipedia version (enwiki) and the word “banana” is the title of the connected page.
Actually, the example I used here can’t be found on Wikipedia. The Hebrew page for banana is connected to the English page “Musa (genus)” and not to the page “Banana”. The page “Banana” doesn’t have a link to Hebrew. This is an example of one of the wonders of Wikipedia – the Hebrew editors decided that the page will talk about the botanic genus and that the fruit doesn’t need a page of its own. The information about the fruit is not missing, it’s just one of the sections in the genus page. I think this is a great example of the differences between the languages (Elephant is another one) because we can learn a lot about the cultures that are the bases of these differences.
But I digressed. So going back to the structure of the language links, it seems as if a link should be undirected, that is, if there is a link between he:Banana to en:Musa then there should also be a link between en:Musa to he:Banana (in mathematics, we call this property commutative). If we take this property one step farther, also taking into account having several languages then we should also have transitive relation property between the language links. Is the language links always fulfill these requirements? usually they do… but sometimes, like in the diagram, links are more complicated than that (the diagram was taken from the article “Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia” – this should have been in a footnote but I don’t have a plugin for that yet). Notice that if we assume these two properties, “en=Ball” would have two links for French wikipedia!
Another unpredicted case is when a language link points to a subsection in an article and not to the article itself. I don’t have an example for that but these things have known to happen.
While taking into account these extreme scenarios, there are bots (that is, automated scripts) that are in charge of adding links between the languages. For example, if I just added a new article about banana in Hebrew and connected it to the English article than I don’t need to go to all the other articles connected to it (in other languages), the bots will do it automatically – Therefore, an underline assumption is that links are undirected. The downside of this process is that every article in Wikipedia has in its code the connections to all the other articles. So if for example, we have 5 articles that are connected to each other then we have 20 (n*(n-1) ~ n^2) lines of code telling us about them. Sounds a bit redundant, especially when n is bigger. This is where Wikidata comes into play (Wikidata has so much more to offer so if you don’t know the project, check it out). Since 2013 (again, I’m not sure when it started) Wikidata has a page about every article in Wikipedia that assembles all of the structured data of the article. One of the Wikidata page’s metadata is a list of Wikipedias for which the article exists (and its title in these Wikis). Therefore, instead of representing the links n^2 times, it only saves it n times – for the Wikipedias it appears in. That is a big difference!
Ok, now you know everything there is to know about the purpose of language links in Wikipedia! In the following post I will talk about language links representation in the database and most importantly, how all of this is related to my project.