Internship Summary, How Time Flies

It turns out that the official time of the internship has past! Actually, The internship ended a month ago but I kept on working on my project, something that I am planning to continue doing for the next couple of months.

Until now, I have succeeded to build a prototype for my project. That is, I designed a few functions that read a specific revision’s text, discussion page or edit summaries, to search characteristics (using mostly regular expressions) that indicate that this page was translated from another language. As for this moment, I’m still working on detecting Hebrew pages that were translated from English. I’m planning to build a few more functions for better detection, the most important one will compare the interlinks within the Hebrew revision page and the English counterpart. The interlinks are themselves connected in the langlinks table. I can use this information to “understand” the text without translation the words.

Before focusing on these functions, I built a few tools that filter the pages that are potentially a translation. By querying the database, I extracted pages (by their title and id) that were created after the English page. Then, using another query I also found the last version in English that could have used as basis for the translation (i.e., the last version in English before the Hebrew one was created). I’m using two different methods to find the Hebrew version. The “dirty” method, also using querying, returns the last version which was created by the user who created the page, that is, until another editor has edited the page (ignoring bots). I consider this as dirty because sometimes the query returns a version that can be a few months after the creation date of the page, and sometimes it returns a version that is too immature, and the author is still working on the page to bring it to a first steady version. The “clean” and more complicated method will use the data from the dumps (that is, all of it). We wish “compress” the data and consider a series of edits by the same editor as one large edit, and save the information of the starting time, ending time, all of the edit ids that underline the new edit, and number of bites at the start and end of the edit. The dumps are organized in a serial way so this is straightforward to do. That way, it will be easy to filter the bots, small edits that are meaningless, and also detect the right edit version for the translation comparison. The new edit version is yet to be perfect as we will still need to put a timing variable to make sure that distant edits will not become the same edit version. As you can understand, this is not an easy task, at least not for me.

My biggest bottleneck, which I have been working on for a while now, is to build the infrastructure that can support the data structure and the Wikipedia text. Specifically, a script that reads all of the information from the dumps and saves only the relevant revisions and discussion versions in an easy to use way. I hope that I will have something ready in the next week (no promises) and then i will be able to start working with the whole data! I’m working to fully understand the best way to build this script as it should run on the English Wikipedia, and that is not negligible amount of data (unlike the hebrew one). So this is where I’m at.

I hope I will continue writing and putting my progress here as well as on GitHub.

This summary post is also a great opportunity to say a few words about the internship. I had a great experience overall, and I think being part of the Gnome program is an amazing way for getting to know the open source world. Specifically, having a mentor that I can regularly nudge and talk to, who also directed and taught me how to navigate this rich and complex world was the most important part of the experience. It was also helpful to know that the learning process I am going through, and the frustrated feelings that sometimes arouse where not exclusively mine. Nevertheless, on hindsight, I thing that I could have used the supportive network the platform offers: emails, IRC channel, Phabricator page and other mediums, more than I did. Of course, this is just another thing that I still need to learn. I also feel really lucky to be given the opportunity to work on a project that allowed me to do things that were in the beginning beyond my current expertise in order to learn and improve my skills.

As I wrote in previously, this is just the middle of this project and I hope that in a month I will have a new post updating about our new and exciting developments.

More information and updates can be found here:

Advertisements

All about Language Links in Wikipedia – Part 3

In this third and last part of the series about language links in Wikipedia, I am going to describe my first real internship task!

By Alfred T. Palmer [Public domain], via Wikimedia Commons

By Alfred T. Palmer, via Wikimedia Commons

We (Amir, my mentor and myself) wanted to calculate the following: for each language, how many of the articles appear in one Wikipedia (e.g., FR) also appear in another (EN). We focused on articles in the main name space (that is, “regular articles” about a topic, not talk pages, user pages, categories and such). I calculated this as the percentage of articles that exist in two languages from the total number of articles in each of the languages. This number was taken from here. (e.g., Intersection(EN, FR)/Count(FR)). We did this for all of the languages (287^2). This calculation is interesting because it gives us some sense about he connections between the Wikis. Also, it was very valuable to Amir have data to base his intuition on regarding which pair of languages to develop an extension to their content translation product.

As I mentioned in the previous post, although on first thought I wanted to use the table that saved the language data from Wikidata, it turned out to be a problematic idea. In the aforementioned Wikidata table there where 42,628,276 rows! 41,588,970 of theme included WIkipedia languages link information. Because I wanted to calculate the co-occurrence of every two languages, it meant joining this table with itself which would have resulted in 10^15/2 rows. So while the query that returned this result is the cleanest, the data structure of the table is not built for these kinds of calculations. Therefore, I queried 287 tables (actually 2 tables, langlinks joined with page), one for each Wikipedia.

The basic code for each table is simple (here it is for English Wikipedia):

SELECT 'enwiki' as source, ll_lang as target, COUNT(*) as count FROM enwiki_p.langlinks
LEFT JOIN enwiki_p.page
ON page_id = ll_from
WHERE page_namespace = 0
GROUP BY ll_lang
The code counts the number of links the current language wiki has to all other languages.
A few things to notice:
  • The links are not symmetrical. there is on average less than one percent difference between the links from lang A to B (outlinks from A, taken from that language langlink table) compared to lang B to A (inlinks to A, other Wikis pointing at that language).
  • Using this method also enabled us to check whether there are irregularities in the data (like mistakes, disproportional difference between inlinks to outlinks). We found that on average, there was around 1% difference between the number of links.

For the actual code I wrote a python script that automatically created the simple script 287 times, one for each language, and then unites the results. The “complex” script return three columns, source language, target language and the count of the number of links that go from the source to the target. After getting that list, I used a pivot table in excel to transform it into a matrix (an easy python/R script can return the same result). Lets talk about the data (as represented by the matrix).

Results:

  1. The co-exist matrix of counts can be found in Google Spreadsheet.
  2. The links are not symmetrical. there is on average less than one percent difference between the links from language A to B compared to language B to A.
  3. Wikis with less than 3500 links (that means the has less than 100 articles) have on average more than 20% outlinks than inlinks.
  4. As the number of langlinks gets bigger (and for most cases, the size of the Wiki), the difference and variance between the inlinks and outlinks gets smaller.
  5. I did fund that some outlinks pointed to mistakes (zh-cn, zh-tw, nn) and fixed that.
  6. There are small languages with unproportional number of linked pages compared to number of articles. This is due to (I think) bot generated articles that don’t have interlinks in the text so they are not counted as articles.
  7. Arabic has 8% more in links than out. There isn’t one Wiki that caused this difference, so it’s not a bot.
  8. Lets dig deeper: a heat map of the number of links between Wikis with more than 50,000 articles. It is ordered by size. (sorry about the quality, it is the best I could do). the diagonal is between in zero for the number of language links between a language to itself.

As I mentioned, the triangles are not symmetrical because the counts (which are themselves not equal but are close enough) are divided by the number of articles in each Wiki. The heat map is between Red – high level of congruence to Yellow – low level. The interesting data is found in the lower triangle that shows the language links between small to large Wikipedias.

LangLink correlations
Insights from the heat map:
  1. Most languages have strong connections to English.
  2. There is a group of interconnected Wikis that are based on Swedish (Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau). The Wikipedia community suggested that is possible these languages have articles about a lot of species and a bot created them (like LsjBot).
  3. Piedmontese (is spoken by over 1 million people in Piedmont, northwest Italy) is highly interconnected with Wikipedias from the family of Roman languages, as do Latin Wikipedia.
  4. Not surprisingly, Chechen is mostly connected to Russian.
  5. Telugu (which is a language in India) doesn’t have many interlinks, not to English, Hindi or Bengali.
  6. There are other visible strong connections (as Serbian and Serbo-Croatian) but they are not surprising as it is practically the same language.

I think the results are interesting, what do you think?

All about Language Links in Wikipedia – Part 2

8966990703_ea93ee8bfc_b

In part 1 I have talked about how language links are organized in Wikipedia (as part of the article’s code) and in Wikidata. Here I’m going to talk about how the data is stored and get some frustration out. In the next post, I will talk about my first task and its results.

Imagine that we want to use the data about the connections between the pages, or any other data in Wikipedia for that matter. It will be a real pain to check every script of every article to get that information. And anyway, these scripts should be saved and backed up somehow (and somewhere). To make the information more easily used and stored, Wikipedia (and Wikidata) are using structured databases. There are tables for page information, revisions, users, language links, text and for other functions (here is the full list of tables). However, there isn’t one database that combines all the information from different Wikipedias, each Wiki has its own database. There are three ways to reach this information: downloading the dumps (XML or SQL); working with the database by ssh (access to Wikimedia tools labs is obliged) or by querying the database directly using SQL quarry web tool (only needs a Wikipedia user account to use it); by using the API directly or with a designated package like pywikibot. Choosing the right tool depends on the specific question and the amount of data that is needed to answer it.

Although this is not the main focus of this post, because I was really frustrated to learn that there are only two, far from perfect, ways to extract text from pages, I want to construe the options to do so. If you need relatively low number of different texts then the best way is to call the language API and request the text. If on the other hand you need many more samples (or, taken to the extreme, the text of every Wikipedia page) then your only option is to use the XML dumps. I think the biggest difference between these methods is that the API is permitted to get into the database, take what is needed and then leave, while the dumps work in an all or nothing way. To find one page, you need to go over all the pages in that dump, which are usually the whole pages in Wikipedia or at minimum all of the new pages from that day or month. Anyway, it’s a lot. Fortunately, this limitation is only true for text.

Back to language links. Luckily, there is a table by the name of langlinks that is used for saving the links that typically appear in the left margin. The table is quite simple having three columns, page id (ll_from), language of the link (ll_lang), and title of the corresponding page in the other language (ll_title). Therefore, it is perfect for querying. The only downside, which I have already mentioned previously, is that every Wiki has its own langlinks table. This data structure is not too bad as we can call different databases in the same query. Yet, it’s still annoying. If you remember, I have talked about Wikidata which supposed to have a table that unites all of the language links for the connected pages. I thought it will be Wikidata’s langlinks table. I was wrong. It turns out WIkidata has its own special tables in addition to the regular Wiki tables that stores its data. The language links are stored in the table “wb_items_per_site” which is also a simple table. However, there are several problems with it (here comes another frustration): First, and that’s not the table’s fault, it was impossible to find. The Wikidata documentation doesn’t talk about these special tables because as it happens, these tables are part of Wikibase which is the software behind Wikidata (in hindsight, I guess it makes sense). Second, the documentation of the tables is just terrible. I managed to understand it only after looking at the table’s content. Last, it wasn’t organized correctly for the problem I wanted to solve.

Ok, finally I think I have nothing more to add about the structure so I can move on to talking about why language links are so important to my task (and therefore, I have spent endless words already on the subject).

All about Language Links in Wikipedia – Part 1

LibraryOnce upon a time in 2001, there was a small internet entity named Wikipedia which aspired to be a free online encyclopedia which everyone can contribute to. Initially, Wikipedia was in English but not long after, new Wikipedias emerge in other languages like German, Catalan, Japanese, Hebrew, French etc.. Wikipedia was not a small internet site anymore.

Like with humans, part of growing is to confront challenges that you had never knew existed. I will focus here on a rather uncomplicated challenge Wikipedia encountered; How do you link the same topic in different Wikipedias (that is, language additions)? The straight forward solution is fairly simple: if you have two articles about “banana” one in English and the second in Hebrew, then connect them with some code in one article that points to the article in the other language. This is exactly what Wikipedia has done until 2013 (don’t take my word on the exact date). They have created a template named interlangauge-links that is part of the page’s code and looks like this: [[en:banana]]. You can see the “en” is pointing to the wikipedia version (enwiki) and the word “banana” is the title of the connected page.

Actually, the example I used here can’t be found on Wikipedia. The Hebrew page for banana is connected to the English page “Musa (genus)” and not to the page “Banana”. The page “Banana” doesn’t have a link to Hebrew. This is an example of one of the wonders of Wikipedia  – the Hebrew editors decided that the page will talk about the botanic genus and that the fruit doesn’t need a page of its own. The information about the fruit is not missing, it’s just one of the sections in the genus page. I think this is a great example of the differences between the languages (Elephant is another one) because we can learn a lot about the cultures that are the bases of these differences.

But I digressed. So going back to the structure of the language links, it seems as if a link should be undirected, that is, if there is a link between he:Banana to en:Musa then there should also be a link between en:Musa to he:Banana (in mathematics, we call this property commutative). If we take this property one step farther, also taking into account having several languages then we should also have transitive relation property between the language links. Is the language links always fulfill these requirements? usually they do… but sometimes, like in the diagram, links are more complicated than that (the diagram was taken from the article “Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia” –  this should have been in a footnote but I don’t have a plugin for that yet).langlinks-pic Notice that if we assume these two properties, “en=Ball” would have two links for French wikipedia!

Another unpredicted case is when  a language link points to a subsection in an article and not to the article itself. I don’t have an example for that but these things have known to happen.

While taking into account these extreme scenarios, there are bots (that is, automated scripts) that are in charge of adding links between the languages. For example, if I just added a new article about banana in Hebrew and connected it to the English article than I don’t need to go to all the other articles connected to it (in other languages), the bots will do it automatically – Therefore, an underline assumption is that links are undirected. The downside of this process is that every article in Wikipedia has in its code the connections to all the other articles. So if for example, we have 5 articles that are connected to each other then we have 20 (n*(n-1) ~ n^2) lines of code telling us about them. Sounds a bit redundant, especially when n is bigger. This is where Wikidata comes into play (Wikidata has so much more to offer so if you don’t know the project, check it out). Since 2013 (again, I’m not sure when it started) Wikidata has a page about every article in Wikipedia that assembles all of the structured data of the article. One of the Wikidata page’s metadata is a list of Wikipedias for which the article exists (and its title in these Wikis). Therefore, instead of representing the links n^2 times, it only saves it n times – for the Wikipedias it appears in. That is a big difference!

Ok, now you know everything there is to know about the purpose of language links in Wikipedia! In the following post I will talk about language links representation in the database and most importantly, how all of this is related to my project.

My Open Program for Women Internship – Finally Writing

You know how you have your hypothetical amazing version of yourself? For me, in that version I always have time for everything; classes, school chores, doing research, friends, partner in life, family and well, life. I counted, I need to be studying/working at least 11 hours a day to make ends meet, not counting all the other activities I want to do. So this is my excuse for not writing how my internship is going – I just didn’t get around to it.

So I’m Neta and I’m never on time. Next.

Wikipedia-logo-v2.svg

I’m doing a really cool project with the international team of the Wikimedia Foundation. We want to try and estimate the fraction of translated articles in a given Wikipedia versus originally written ones. It turns out that this number is unknown because when someone start a new article, they usually don’t indicate whether it a new original work or was it translated from another wiki. Moreover, even if they do, there isn’t a feature in the data base that saves it.

Why is it interesting? First and foremost, it gives information about the editors’ behavior and how they create new pages. For example, finding that 25% of the Hebrew wikipedia is translated from English means a totally different thing than finding that only 5% is translated because it will highlight how Hebrew is dependent upon English. Moreover, it could illustrate to the Wikimedia Foundation that they should put more emphasis on the content translation tool from English to Hebrew in order to increase the number of articles in Hebrew (try it here – its new and shiny!).

Another reason why it is interesting lies on the fact that editors on Wikipedia represents, though not fully, their culture. Whether an article is translated and from which language tells us something about the dependencies and relationships between these cultures. Furthermore, we can try and divide our results into categories and see if different classes of articles may be drawn from different cultures. In fact, it may well be the case that different cultures “specialize” in different content.

Unlike most projects in the program, my project is a research based one where the task itself, the method I should use to solve it and the ending point are unknown. I don’t even know if it is feasible or whether we would get meaningful results … but that’s research. Exciting!

Ants 2

As more than half of the internship has passed, I think that it is a good time to go back to the start. The beginning of my quest was not easy and contained a lot of searching and reading and not a lot of code. My first goal was to learn the wiki world from different angles while emphasizing content that is related to translations :

  • Communication: Learning to communicate with other editors on village pump and on my talk page (its a weird concept, having my own billboard were people can write me messages and only hypothetically because my talk page is not interesting, everyone can read it).
  • Structure (data): Understanding the schema and tables of the wikipedia data.
  • Content: Getting lost in a good way in wikipedia in order to stumble upon relevant and interesting things related to translation. For example, did you know that in the Hebrew Wikipedia there is a page specifically for translation do and don’t.
  • Accessing the data: Getting to know the mysterious ways of the dumps, the Wikipedia API, SSH and query tool that are part of Wikimedia too labs. Here is where I got drawn into trying to understand Wikidata which is the central storage the structured data of all the wiki projects. I failed in that… (Amir, my mentor, asked me to blog about my frustrations so I will expand this in a different post).

All in all, I think that the decision to learn before starting to code was a good one. I actually feel that I now know what I’m talking about and what I need to do. That not to say that I don’t feel lost sometimes or still don’t have things to learn (SSH and dumps – I’m looking at you) but it does say that when I wake up in the morning with a list of tasks for the day, I actually know how to start working on them, at least most of the time.

Next time I will tell you all about working with language links, which became my first programming assignment.