All about Language Links in Wikipedia – Part 3

In this third and last part of the series about language links in Wikipedia, I am going to describe my first real internship task!

By Alfred T. Palmer [Public domain], via Wikimedia Commons

By Alfred T. Palmer, via Wikimedia Commons

We (Amir, my mentor and myself) wanted to calculate the following: for each language, how many of the articles appear in one Wikipedia (e.g., FR) also appear in another (EN). We focused on articles in the main name space (that is, “regular articles” about a topic, not talk pages, user pages, categories and such). I calculated this as the percentage of articles that exist in two languages from the total number of articles in each of the languages. This number was taken from here. (e.g., Intersection(EN, FR)/Count(FR)). We did this for all of the languages (287^2). This calculation is interesting because it gives us some sense about he connections between the Wikis. Also, it was very valuable to Amir have data to base his intuition on regarding which pair of languages to develop an extension to their content translation product.

As I mentioned in the previous post, although on first thought I wanted to use the table that saved the language data from Wikidata, it turned out to be a problematic idea. In the aforementioned Wikidata table there where 42,628,276 rows! 41,588,970 of theme included WIkipedia languages link information. Because I wanted to calculate the co-occurrence of every two languages, it meant joining this table with itself which would have resulted in 10^15/2 rows. So while the query that returned this result is the cleanest, the data structure of the table is not built for these kinds of calculations. Therefore, I queried 287 tables (actually 2 tables, langlinks joined with page), one for each Wikipedia.

The basic code for each table is simple (here it is for English Wikipedia):

SELECT 'enwiki' as source, ll_lang as target, COUNT(*) as count FROM enwiki_p.langlinks
LEFT JOIN enwiki_p.page
ON page_id = ll_from
WHERE page_namespace = 0
GROUP BY ll_lang
The code counts the number of links the current language wiki has to all other languages.
A few things to notice:
  • The links are not symmetrical. there is on average less than one percent difference between the links from lang A to B (outlinks from A, taken from that language langlink table) compared to lang B to A (inlinks to A, other Wikis pointing at that language).
  • Using this method also enabled us to check whether there are irregularities in the data (like mistakes, disproportional difference between inlinks to outlinks). We found that on average, there was around 1% difference between the number of links.

For the actual code I wrote a python script that automatically created the simple script 287 times, one for each language, and then unites the results. The “complex” script return three columns, source language, target language and the count of the number of links that go from the source to the target. After getting that list, I used a pivot table in excel to transform it into a matrix (an easy python/R script can return the same result). Lets talk about the data (as represented by the matrix).

Results:

  1. The co-exist matrix of counts can be found in Google Spreadsheet.
  2. The links are not symmetrical. there is on average less than one percent difference between the links from language A to B compared to language B to A.
  3. Wikis with less than 3500 links (that means the has less than 100 articles) have on average more than 20% outlinks than inlinks.
  4. As the number of langlinks gets bigger (and for most cases, the size of the Wiki), the difference and variance between the inlinks and outlinks gets smaller.
  5. I did fund that some outlinks pointed to mistakes (zh-cn, zh-tw, nn) and fixed that.
  6. There are small languages with unproportional number of linked pages compared to number of articles. This is due to (I think) bot generated articles that don’t have interlinks in the text so they are not counted as articles.
  7. Arabic has 8% more in links than out. There isn’t one Wiki that caused this difference, so it’s not a bot.
  8. Lets dig deeper: a heat map of the number of links between Wikis with more than 50,000 articles. It is ordered by size. (sorry about the quality, it is the best I could do). the diagonal is between in zero for the number of language links between a language to itself.

As I mentioned, the triangles are not symmetrical because the counts (which are themselves not equal but are close enough) are divided by the number of articles in each Wiki. The heat map is between Red – high level of congruence to Yellow – low level. The interesting data is found in the lower triangle that shows the language links between small to large Wikipedias.

LangLink correlations
Insights from the heat map:
  1. Most languages have strong connections to English.
  2. There is a group of interconnected Wikis that are based on Swedish (Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau). The Wikipedia community suggested that is possible these languages have articles about a lot of species and a bot created them (like LsjBot).
  3. Piedmontese (is spoken by over 1 million people in Piedmont, northwest Italy) is highly interconnected with Wikipedias from the family of Roman languages, as do Latin Wikipedia.
  4. Not surprisingly, Chechen is mostly connected to Russian.
  5. Telugu (which is a language in India) doesn’t have many interlinks, not to English, Hindi or Bengali.
  6. There are other visible strong connections (as Serbian and Serbo-Croatian) but they are not surprising as it is practically the same language.

I think the results are interesting, what do you think?

All about Language Links in Wikipedia – Part 2

8966990703_ea93ee8bfc_b

In part 1 I have talked about how language links are organized in Wikipedia (as part of the article’s code) and in Wikidata. Here I’m going to talk about how the data is stored and get some frustration out. In the next post, I will talk about my first task and its results.

Imagine that we want to use the data about the connections between the pages, or any other data in Wikipedia for that matter. It will be a real pain to check every script of every article to get that information. And anyway, these scripts should be saved and backed up somehow (and somewhere). To make the information more easily used and stored, Wikipedia (and Wikidata) are using structured databases. There are tables for page information, revisions, users, language links, text and for other functions (here is the full list of tables). However, there isn’t one database that combines all the information from different Wikipedias, each Wiki has its own database. There are three ways to reach this information: downloading the dumps (XML or SQL); working with the database by ssh (access to Wikimedia tools labs is obliged) or by querying the database directly using SQL quarry web tool (only needs a Wikipedia user account to use it); by using the API directly or with a designated package like pywikibot. Choosing the right tool depends on the specific question and the amount of data that is needed to answer it.

Although this is not the main focus of this post, because I was really frustrated to learn that there are only two, far from perfect, ways to extract text from pages, I want to construe the options to do so. If you need relatively low number of different texts then the best way is to call the language API and request the text. If on the other hand you need many more samples (or, taken to the extreme, the text of every Wikipedia page) then your only option is to use the XML dumps. I think the biggest difference between these methods is that the API is permitted to get into the database, take what is needed and then leave, while the dumps work in an all or nothing way. To find one page, you need to go over all the pages in that dump, which are usually the whole pages in Wikipedia or at minimum all of the new pages from that day or month. Anyway, it’s a lot. Fortunately, this limitation is only true for text.

Back to language links. Luckily, there is a table by the name of langlinks that is used for saving the links that typically appear in the left margin. The table is quite simple having three columns, page id (ll_from), language of the link (ll_lang), and title of the corresponding page in the other language (ll_title). Therefore, it is perfect for querying. The only downside, which I have already mentioned previously, is that every Wiki has its own langlinks table. This data structure is not too bad as we can call different databases in the same query. Yet, it’s still annoying. If you remember, I have talked about Wikidata which supposed to have a table that unites all of the language links for the connected pages. I thought it will be Wikidata’s langlinks table. I was wrong. It turns out WIkidata has its own special tables in addition to the regular Wiki tables that stores its data. The language links are stored in the table “wb_items_per_site” which is also a simple table. However, there are several problems with it (here comes another frustration): First, and that’s not the table’s fault, it was impossible to find. The Wikidata documentation doesn’t talk about these special tables because as it happens, these tables are part of Wikibase which is the software behind Wikidata (in hindsight, I guess it makes sense). Second, the documentation of the tables is just terrible. I managed to understand it only after looking at the table’s content. Last, it wasn’t organized correctly for the problem I wanted to solve.

Ok, finally I think I have nothing more to add about the structure so I can move on to talking about why language links are so important to my task (and therefore, I have spent endless words already on the subject).

All about Language Links in Wikipedia – Part 1

LibraryOnce upon a time in 2001, there was a small internet entity named Wikipedia which aspired to be a free online encyclopedia which everyone can contribute to. Initially, Wikipedia was in English but not long after, new Wikipedias emerge in other languages like German, Catalan, Japanese, Hebrew, French etc.. Wikipedia was not a small internet site anymore.

Like with humans, part of growing is to confront challenges that you had never knew existed. I will focus here on a rather uncomplicated challenge Wikipedia encountered; How do you link the same topic in different Wikipedias (that is, language additions)? The straight forward solution is fairly simple: if you have two articles about “banana” one in English and the second in Hebrew, then connect them with some code in one article that points to the article in the other language. This is exactly what Wikipedia has done until 2013 (don’t take my word on the exact date). They have created a template named interlangauge-links that is part of the page’s code and looks like this: [[en:banana]]. You can see the “en” is pointing to the wikipedia version (enwiki) and the word “banana” is the title of the connected page.

Actually, the example I used here can’t be found on Wikipedia. The Hebrew page for banana is connected to the English page “Musa (genus)” and not to the page “Banana”. The page “Banana” doesn’t have a link to Hebrew. This is an example of one of the wonders of Wikipedia  – the Hebrew editors decided that the page will talk about the botanic genus and that the fruit doesn’t need a page of its own. The information about the fruit is not missing, it’s just one of the sections in the genus page. I think this is a great example of the differences between the languages (Elephant is another one) because we can learn a lot about the cultures that are the bases of these differences.

But I digressed. So going back to the structure of the language links, it seems as if a link should be undirected, that is, if there is a link between he:Banana to en:Musa then there should also be a link between en:Musa to he:Banana (in mathematics, we call this property commutative). If we take this property one step farther, also taking into account having several languages then we should also have transitive relation property between the language links. Is the language links always fulfill these requirements? usually they do… but sometimes, like in the diagram, links are more complicated than that (the diagram was taken from the article “Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia” –  this should have been in a footnote but I don’t have a plugin for that yet).langlinks-pic Notice that if we assume these two properties, “en=Ball” would have two links for French wikipedia!

Another unpredicted case is when  a language link points to a subsection in an article and not to the article itself. I don’t have an example for that but these things have known to happen.

While taking into account these extreme scenarios, there are bots (that is, automated scripts) that are in charge of adding links between the languages. For example, if I just added a new article about banana in Hebrew and connected it to the English article than I don’t need to go to all the other articles connected to it (in other languages), the bots will do it automatically – Therefore, an underline assumption is that links are undirected. The downside of this process is that every article in Wikipedia has in its code the connections to all the other articles. So if for example, we have 5 articles that are connected to each other then we have 20 (n*(n-1) ~ n^2) lines of code telling us about them. Sounds a bit redundant, especially when n is bigger. This is where Wikidata comes into play (Wikidata has so much more to offer so if you don’t know the project, check it out). Since 2013 (again, I’m not sure when it started) Wikidata has a page about every article in Wikipedia that assembles all of the structured data of the article. One of the Wikidata page’s metadata is a list of Wikipedias for which the article exists (and its title in these Wikis). Therefore, instead of representing the links n^2 times, it only saves it n times – for the Wikipedias it appears in. That is a big difference!

Ok, now you know everything there is to know about the purpose of language links in Wikipedia! In the following post I will talk about language links representation in the database and most importantly, how all of this is related to my project.