All about Language Links in Wikipedia – Part 3

In this third and last part of the series about language links in Wikipedia, I am going to describe my first real internship task!

By Alfred T. Palmer [Public domain], via Wikimedia Commons

By Alfred T. Palmer, via Wikimedia Commons

We (Amir, my mentor and myself) wanted to calculate the following: for each language, how many of the articles appear in one Wikipedia (e.g., FR) also appear in another (EN). We focused on articles in the main name space (that is, “regular articles” about a topic, not talk pages, user pages, categories and such). I calculated this as the percentage of articles that exist in two languages from the total number of articles in each of the languages. This number was taken from here. (e.g., Intersection(EN, FR)/Count(FR)). We did this for all of the languages (287^2). This calculation is interesting because it gives us some sense about he connections between the Wikis. Also, it was very valuable to Amir have data to base his intuition on regarding which pair of languages to develop an extension to their content translation product.

As I mentioned in the previous post, although on first thought I wanted to use the table that saved the language data from Wikidata, it turned out to be a problematic idea. In the aforementioned Wikidata table there where 42,628,276 rows! 41,588,970 of theme included WIkipedia languages link information. Because I wanted to calculate the co-occurrence of every two languages, it meant joining this table with itself which would have resulted in 10^15/2 rows. So while the query that returned this result is the cleanest, the data structure of the table is not built for these kinds of calculations. Therefore, I queried 287 tables (actually 2 tables, langlinks joined with page), one for each Wikipedia.

The basic code for each table is simple (here it is for English Wikipedia):

SELECT 'enwiki' as source, ll_lang as target, COUNT(*) as count FROM enwiki_p.langlinks
LEFT JOIN enwiki_p.page
ON page_id = ll_from
WHERE page_namespace = 0
GROUP BY ll_lang
The code counts the number of links the current language wiki has to all other languages.
A few things to notice:
  • The links are not symmetrical. there is on average less than one percent difference between the links from lang A to B (outlinks from A, taken from that language langlink table) compared to lang B to A (inlinks to A, other Wikis pointing at that language).
  • Using this method also enabled us to check whether there are irregularities in the data (like mistakes, disproportional difference between inlinks to outlinks). We found that on average, there was around 1% difference between the number of links.

For the actual code I wrote a python script that automatically created the simple script 287 times, one for each language, and then unites the results. The “complex” script return three columns, source language, target language and the count of the number of links that go from the source to the target. After getting that list, I used a pivot table in excel to transform it into a matrix (an easy python/R script can return the same result). Lets talk about the data (as represented by the matrix).

Results:

  1. The co-exist matrix of counts can be found in Google Spreadsheet.
  2. The links are not symmetrical. there is on average less than one percent difference between the links from language A to B compared to language B to A.
  3. Wikis with less than 3500 links (that means the has less than 100 articles) have on average more than 20% outlinks than inlinks.
  4. As the number of langlinks gets bigger (and for most cases, the size of the Wiki), the difference and variance between the inlinks and outlinks gets smaller.
  5. I did fund that some outlinks pointed to mistakes (zh-cn, zh-tw, nn) and fixed that.
  6. There are small languages with unproportional number of linked pages compared to number of articles. This is due to (I think) bot generated articles that don’t have interlinks in the text so they are not counted as articles.
  7. Arabic has 8% more in links than out. There isn’t one Wiki that caused this difference, so it’s not a bot.
  8. Lets dig deeper: a heat map of the number of links between Wikis with more than 50,000 articles. It is ordered by size. (sorry about the quality, it is the best I could do). the diagonal is between in zero for the number of language links between a language to itself.

As I mentioned, the triangles are not symmetrical because the counts (which are themselves not equal but are close enough) are divided by the number of articles in each Wiki. The heat map is between Red – high level of congruence to Yellow – low level. The interesting data is found in the lower triangle that shows the language links between small to large Wikipedias.

LangLink correlations
Insights from the heat map:
  1. Most languages have strong connections to English.
  2. There is a group of interconnected Wikis that are based on Swedish (Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau). The Wikipedia community suggested that is possible these languages have articles about a lot of species and a bot created them (like LsjBot).
  3. Piedmontese (is spoken by over 1 million people in Piedmont, northwest Italy) is highly interconnected with Wikipedias from the family of Roman languages, as do Latin Wikipedia.
  4. Not surprisingly, Chechen is mostly connected to Russian.
  5. Telugu (which is a language in India) doesn’t have many interlinks, not to English, Hindi or Bengali.
  6. There are other visible strong connections (as Serbian and Serbo-Croatian) but they are not surprising as it is practically the same language.

I think the results are interesting, what do you think?

All about Language Links in Wikipedia – Part 1

LibraryOnce upon a time in 2001, there was a small internet entity named Wikipedia which aspired to be a free online encyclopedia which everyone can contribute to. Initially, Wikipedia was in English but not long after, new Wikipedias emerge in other languages like German, Catalan, Japanese, Hebrew, French etc.. Wikipedia was not a small internet site anymore.

Like with humans, part of growing is to confront challenges that you had never knew existed. I will focus here on a rather uncomplicated challenge Wikipedia encountered; How do you link the same topic in different Wikipedias (that is, language additions)? The straight forward solution is fairly simple: if you have two articles about “banana” one in English and the second in Hebrew, then connect them with some code in one article that points to the article in the other language. This is exactly what Wikipedia has done until 2013 (don’t take my word on the exact date). They have created a template named interlangauge-links that is part of the page’s code and looks like this: [[en:banana]]. You can see the “en” is pointing to the wikipedia version (enwiki) and the word “banana” is the title of the connected page.

Actually, the example I used here can’t be found on Wikipedia. The Hebrew page for banana is connected to the English page “Musa (genus)” and not to the page “Banana”. The page “Banana” doesn’t have a link to Hebrew. This is an example of one of the wonders of Wikipedia  – the Hebrew editors decided that the page will talk about the botanic genus and that the fruit doesn’t need a page of its own. The information about the fruit is not missing, it’s just one of the sections in the genus page. I think this is a great example of the differences between the languages (Elephant is another one) because we can learn a lot about the cultures that are the bases of these differences.

But I digressed. So going back to the structure of the language links, it seems as if a link should be undirected, that is, if there is a link between he:Banana to en:Musa then there should also be a link between en:Musa to he:Banana (in mathematics, we call this property commutative). If we take this property one step farther, also taking into account having several languages then we should also have transitive relation property between the language links. Is the language links always fulfill these requirements? usually they do… but sometimes, like in the diagram, links are more complicated than that (the diagram was taken from the article “Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia” –  this should have been in a footnote but I don’t have a plugin for that yet).langlinks-pic Notice that if we assume these two properties, “en=Ball” would have two links for French wikipedia!

Another unpredicted case is when  a language link points to a subsection in an article and not to the article itself. I don’t have an example for that but these things have known to happen.

While taking into account these extreme scenarios, there are bots (that is, automated scripts) that are in charge of adding links between the languages. For example, if I just added a new article about banana in Hebrew and connected it to the English article than I don’t need to go to all the other articles connected to it (in other languages), the bots will do it automatically – Therefore, an underline assumption is that links are undirected. The downside of this process is that every article in Wikipedia has in its code the connections to all the other articles. So if for example, we have 5 articles that are connected to each other then we have 20 (n*(n-1) ~ n^2) lines of code telling us about them. Sounds a bit redundant, especially when n is bigger. This is where Wikidata comes into play (Wikidata has so much more to offer so if you don’t know the project, check it out). Since 2013 (again, I’m not sure when it started) Wikidata has a page about every article in Wikipedia that assembles all of the structured data of the article. One of the Wikidata page’s metadata is a list of Wikipedias for which the article exists (and its title in these Wikis). Therefore, instead of representing the links n^2 times, it only saves it n times – for the Wikipedias it appears in. That is a big difference!

Ok, now you know everything there is to know about the purpose of language links in Wikipedia! In the following post I will talk about language links representation in the database and most importantly, how all of this is related to my project.

My Open Program for Women Internship – Finally Writing

You know how you have your hypothetical amazing version of yourself? For me, in that version I always have time for everything; classes, school chores, doing research, friends, partner in life, family and well, life. I counted, I need to be studying/working at least 11 hours a day to make ends meet, not counting all the other activities I want to do. So this is my excuse for not writing how my internship is going – I just didn’t get around to it.

So I’m Neta and I’m never on time. Next.

Wikipedia-logo-v2.svg

I’m doing a really cool project with the international team of the Wikimedia Foundation. We want to try and estimate the fraction of translated articles in a given Wikipedia versus originally written ones. It turns out that this number is unknown because when someone start a new article, they usually don’t indicate whether it a new original work or was it translated from another wiki. Moreover, even if they do, there isn’t a feature in the data base that saves it.

Why is it interesting? First and foremost, it gives information about the editors’ behavior and how they create new pages. For example, finding that 25% of the Hebrew wikipedia is translated from English means a totally different thing than finding that only 5% is translated because it will highlight how Hebrew is dependent upon English. Moreover, it could illustrate to the Wikimedia Foundation that they should put more emphasis on the content translation tool from English to Hebrew in order to increase the number of articles in Hebrew (try it here – its new and shiny!).

Another reason why it is interesting lies on the fact that editors on Wikipedia represents, though not fully, their culture. Whether an article is translated and from which language tells us something about the dependencies and relationships between these cultures. Furthermore, we can try and divide our results into categories and see if different classes of articles may be drawn from different cultures. In fact, it may well be the case that different cultures “specialize” in different content.

Unlike most projects in the program, my project is a research based one where the task itself, the method I should use to solve it and the ending point are unknown. I don’t even know if it is feasible or whether we would get meaningful results … but that’s research. Exciting!

Ants 2

As more than half of the internship has passed, I think that it is a good time to go back to the start. The beginning of my quest was not easy and contained a lot of searching and reading and not a lot of code. My first goal was to learn the wiki world from different angles while emphasizing content that is related to translations :

  • Communication: Learning to communicate with other editors on village pump and on my talk page (its a weird concept, having my own billboard were people can write me messages and only hypothetically because my talk page is not interesting, everyone can read it).
  • Structure (data): Understanding the schema and tables of the wikipedia data.
  • Content: Getting lost in a good way in wikipedia in order to stumble upon relevant and interesting things related to translation. For example, did you know that in the Hebrew Wikipedia there is a page specifically for translation do and don’t.
  • Accessing the data: Getting to know the mysterious ways of the dumps, the Wikipedia API, SSH and query tool that are part of Wikimedia too labs. Here is where I got drawn into trying to understand Wikidata which is the central storage the structured data of all the wiki projects. I failed in that… (Amir, my mentor, asked me to blog about my frustrations so I will expand this in a different post).

All in all, I think that the decision to learn before starting to code was a good one. I actually feel that I now know what I’m talking about and what I need to do. That not to say that I don’t feel lost sometimes or still don’t have things to learn (SSH and dumps – I’m looking at you) but it does say that when I wake up in the morning with a list of tasks for the day, I actually know how to start working on them, at least most of the time.

Next time I will tell you all about working with language links, which became my first programming assignment.