In this third and last part of the series about language links in Wikipedia, I am going to describe my first real internship task!We (Amir, my mentor and myself) wanted to calculate the following: for each language, how many of the articles appear in one Wikipedia (e.g., FR) also appear in another (EN). We focused on articles in the main name space (that is, “regular articles” about a topic, not talk pages, user pages, categories and such). I calculated this as the percentage of articles that exist in two languages from the total number of articles in each of the languages. This number was taken from here. (e.g., Intersection(EN, FR)/Count(FR)). We did this for all of the languages (287^2). This calculation is interesting because it gives us some sense about he connections between the Wikis. Also, it was very valuable to Amir have data to base his intuition on regarding which pair of languages to develop an extension to their content translation product.
As I mentioned in the previous post, although on first thought I wanted to use the table that saved the language data from Wikidata, it turned out to be a problematic idea. In the aforementioned Wikidata table there where 42,628,276 rows! 41,588,970 of theme included WIkipedia languages link information. Because I wanted to calculate the co-occurrence of every two languages, it meant joining this table with itself which would have resulted in 10^15/2 rows. So while the query that returned this result is the cleanest, the data structure of the table is not built for these kinds of calculations. Therefore, I queried 287 tables (actually 2 tables, langlinks joined with page), one for each Wikipedia.
The basic code for each table is simple (here it is for English Wikipedia):
SELECT 'enwiki' as source, ll_lang as target, COUNT(*) as count FROM enwiki_p.langlinks LEFT JOIN enwiki_p.page ON page_id = ll_from WHERE page_namespace = 0 GROUP BY ll_lang
- The links are not symmetrical. there is on average less than one percent difference between the links from lang A to B (outlinks from A, taken from that language langlink table) compared to lang B to A (inlinks to A, other Wikis pointing at that language).
- Using this method also enabled us to check whether there are irregularities in the data (like mistakes, disproportional difference between inlinks to outlinks). We found that on average, there was around 1% difference between the number of links.
For the actual code I wrote a python script that automatically created the simple script 287 times, one for each language, and then unites the results. The “complex” script return three columns, source language, target language and the count of the number of links that go from the source to the target. After getting that list, I used a pivot table in excel to transform it into a matrix (an easy python/R script can return the same result). Lets talk about the data (as represented by the matrix).
- The co-exist matrix of counts can be found in Google Spreadsheet.
- The links are not symmetrical. there is on average less than one percent difference between the links from language A to B compared to language B to A.
- Wikis with less than 3500 links (that means the has less than 100 articles) have on average more than 20% outlinks than inlinks.
- As the number of langlinks gets bigger (and for most cases, the size of the Wiki), the difference and variance between the inlinks and outlinks gets smaller.
- I did fund that some outlinks pointed to mistakes (zh-cn, zh-tw, nn) and fixed that.
- There are small languages with unproportional number of linked pages compared to number of articles. This is due to (I think) bot generated articles that don’t have interlinks in the text so they are not counted as articles.
- Arabic has 8% more in links than out. There isn’t one Wiki that caused this difference, so it’s not a bot.
- Lets dig deeper: a heat map of the number of links between Wikis with more than 50,000 articles. It is ordered by size. (sorry about the quality, it is the best I could do). the diagonal is between in zero for the number of language links between a language to itself.
As I mentioned, the triangles are not symmetrical because the counts (which are themselves not equal but are close enough) are divided by the number of articles in each Wiki. The heat map is between Red – high level of congruence to Yellow – low level. The interesting data is found in the lower triangle that shows the language links between small to large Wikipedias.
- Most languages have strong connections to English.
- There is a group of interconnected Wikis that are based on Swedish (Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau). The Wikipedia community suggested that is possible these languages have articles about a lot of species and a bot created them (like LsjBot).
- Piedmontese (is spoken by over 1 million people in Piedmont, northwest Italy) is highly interconnected with Wikipedias from the family of Roman languages, as do Latin Wikipedia.
- Not surprisingly, Chechen is mostly connected to Russian.
- Telugu (which is a language in India) doesn’t have many interlinks, not to English, Hindi or Bengali.
- There are other visible strong connections (as Serbian and Serbo-Croatian) but they are not surprising as it is practically the same language.
I think the results are interesting, what do you think?