Internship Summary, How Time Flies

It turns out that the official time of the internship has past! Actually, The internship ended a month ago but I kept on working on my project, something that I am planning to continue doing for the next couple of months.

Until now, I have succeeded to build a prototype for my project. That is, I designed a few functions that read a specific revision’s text, discussion page or edit summaries, to search characteristics (using mostly regular expressions) that indicate that this page was translated from another language. As for this moment, I’m still working on detecting Hebrew pages that were translated from English. I’m planning to build a few more functions for better detection, the most important one will compare the interlinks within the Hebrew revision page and the English counterpart. The interlinks are themselves connected in the langlinks table. I can use this information to “understand” the text without translation the words.

Before focusing on these functions, I built a few tools that filter the pages that are potentially a translation. By querying the database, I extracted pages (by their title and id) that were created after the English page. Then, using another query I also found the last version in English that could have used as basis for the translation (i.e., the last version in English before the Hebrew one was created). I’m using two different methods to find the Hebrew version. The “dirty” method, also using querying, returns the last version which was created by the user who created the page, that is, until another editor has edited the page (ignoring bots). I consider this as dirty because sometimes the query returns a version that can be a few months after the creation date of the page, and sometimes it returns a version that is too immature, and the author is still working on the page to bring it to a first steady version. The “clean” and more complicated method will use the data from the dumps (that is, all of it). We wish “compress” the data and consider a series of edits by the same editor as one large edit, and save the information of the starting time, ending time, all of the edit ids that underline the new edit, and number of bites at the start and end of the edit. The dumps are organized in a serial way so this is straightforward to do. That way, it will be easy to filter the bots, small edits that are meaningless, and also detect the right edit version for the translation comparison. The new edit version is yet to be perfect as we will still need to put a timing variable to make sure that distant edits will not become the same edit version. As you can understand, this is not an easy task, at least not for me.

My biggest bottleneck, which I have been working on for a while now, is to build the infrastructure that can support the data structure and the Wikipedia text. Specifically, a script that reads all of the information from the dumps and saves only the relevant revisions and discussion versions in an easy to use way. I hope that I will have something ready in the next week (no promises) and then i will be able to start working with the whole data! I’m working to fully understand the best way to build this script as it should run on the English Wikipedia, and that is not negligible amount of data (unlike the hebrew one). So this is where I’m at.

I hope I will continue writing and putting my progress here as well as on GitHub.

This summary post is also a great opportunity to say a few words about the internship. I had a great experience overall, and I think being part of the Gnome program is an amazing way for getting to know the open source world. Specifically, having a mentor that I can regularly nudge and talk to, who also directed and taught me how to navigate this rich and complex world was the most important part of the experience. It was also helpful to know that the learning process I am going through, and the frustrated feelings that sometimes arouse where not exclusively mine. Nevertheless, on hindsight, I thing that I could have used the supportive network the platform offers: emails, IRC channel, Phabricator page and other mediums, more than I did. Of course, this is just another thing that I still need to learn. I also feel really lucky to be given the opportunity to work on a project that allowed me to do things that were in the beginning beyond my current expertise in order to learn and improve my skills.

As I wrote in previously, this is just the middle of this project and I hope that in a month I will have a new post updating about our new and exciting developments.

More information and updates can be found here:

Advertisements

All about Language Links in Wikipedia – Part 2

8966990703_ea93ee8bfc_b

In part 1 I have talked about how language links are organized in Wikipedia (as part of the article’s code) and in Wikidata. Here I’m going to talk about how the data is stored and get some frustration out. In the next post, I will talk about my first task and its results.

Imagine that we want to use the data about the connections between the pages, or any other data in Wikipedia for that matter. It will be a real pain to check every script of every article to get that information. And anyway, these scripts should be saved and backed up somehow (and somewhere). To make the information more easily used and stored, Wikipedia (and Wikidata) are using structured databases. There are tables for page information, revisions, users, language links, text and for other functions (here is the full list of tables). However, there isn’t one database that combines all the information from different Wikipedias, each Wiki has its own database. There are three ways to reach this information: downloading the dumps (XML or SQL); working with the database by ssh (access to Wikimedia tools labs is obliged) or by querying the database directly using SQL quarry web tool (only needs a Wikipedia user account to use it); by using the API directly or with a designated package like pywikibot. Choosing the right tool depends on the specific question and the amount of data that is needed to answer it.

Although this is not the main focus of this post, because I was really frustrated to learn that there are only two, far from perfect, ways to extract text from pages, I want to construe the options to do so. If you need relatively low number of different texts then the best way is to call the language API and request the text. If on the other hand you need many more samples (or, taken to the extreme, the text of every Wikipedia page) then your only option is to use the XML dumps. I think the biggest difference between these methods is that the API is permitted to get into the database, take what is needed and then leave, while the dumps work in an all or nothing way. To find one page, you need to go over all the pages in that dump, which are usually the whole pages in Wikipedia or at minimum all of the new pages from that day or month. Anyway, it’s a lot. Fortunately, this limitation is only true for text.

Back to language links. Luckily, there is a table by the name of langlinks that is used for saving the links that typically appear in the left margin. The table is quite simple having three columns, page id (ll_from), language of the link (ll_lang), and title of the corresponding page in the other language (ll_title). Therefore, it is perfect for querying. The only downside, which I have already mentioned previously, is that every Wiki has its own langlinks table. This data structure is not too bad as we can call different databases in the same query. Yet, it’s still annoying. If you remember, I have talked about Wikidata which supposed to have a table that unites all of the language links for the connected pages. I thought it will be Wikidata’s langlinks table. I was wrong. It turns out WIkidata has its own special tables in addition to the regular Wiki tables that stores its data. The language links are stored in the table “wb_items_per_site” which is also a simple table. However, there are several problems with it (here comes another frustration): First, and that’s not the table’s fault, it was impossible to find. The Wikidata documentation doesn’t talk about these special tables because as it happens, these tables are part of Wikibase which is the software behind Wikidata (in hindsight, I guess it makes sense). Second, the documentation of the tables is just terrible. I managed to understand it only after looking at the table’s content. Last, it wasn’t organized correctly for the problem I wanted to solve.

Ok, finally I think I have nothing more to add about the structure so I can move on to talking about why language links are so important to my task (and therefore, I have spent endless words already on the subject).