It turns out that the official time of the internship has past! Actually, The internship ended a month ago but I kept on working on my project, something that I am planning to continue doing for the next couple of months.
Until now, I have succeeded to build a prototype for my project. That is, I designed a few functions that read a specific revision’s text, discussion page or edit summaries, to search characteristics (using mostly regular expressions) that indicate that this page was translated from another language. As for this moment, I’m still working on detecting Hebrew pages that were translated from English. I’m planning to build a few more functions for better detection, the most important one will compare the interlinks within the Hebrew revision page and the English counterpart. The interlinks are themselves connected in the langlinks table. I can use this information to “understand” the text without translation the words.
Before focusing on these functions, I built a few tools that filter the pages that are potentially a translation. By querying the database, I extracted pages (by their title and id) that were created after the English page. Then, using another query I also found the last version in English that could have used as basis for the translation (i.e., the last version in English before the Hebrew one was created). I’m using two different methods to find the Hebrew version. The “dirty” method, also using querying, returns the last version which was created by the user who created the page, that is, until another editor has edited the page (ignoring bots). I consider this as dirty because sometimes the query returns a version that can be a few months after the creation date of the page, and sometimes it returns a version that is too immature, and the author is still working on the page to bring it to a first steady version. The “clean” and more complicated method will use the data from the dumps (that is, all of it). We wish “compress” the data and consider a series of edits by the same editor as one large edit, and save the information of the starting time, ending time, all of the edit ids that underline the new edit, and number of bites at the start and end of the edit. The dumps are organized in a serial way so this is straightforward to do. That way, it will be easy to filter the bots, small edits that are meaningless, and also detect the right edit version for the translation comparison. The new edit version is yet to be perfect as we will still need to put a timing variable to make sure that distant edits will not become the same edit version. As you can understand, this is not an easy task, at least not for me.
My biggest bottleneck, which I have been working on for a while now, is to build the infrastructure that can support the data structure and the Wikipedia text. Specifically, a script that reads all of the information from the dumps and saves only the relevant revisions and discussion versions in an easy to use way. I hope that I will have something ready in the next week (no promises) and then i will be able to start working with the whole data! I’m working to fully understand the best way to build this script as it should run on the English Wikipedia, and that is not negligible amount of data (unlike the hebrew one). So this is where I’m at.
I hope I will continue writing and putting my progress here as well as on GitHub.
This summary post is also a great opportunity to say a few words about the internship. I had a great experience overall, and I think being part of the Gnome program is an amazing way for getting to know the open source world. Specifically, having a mentor that I can regularly nudge and talk to, who also directed and taught me how to navigate this rich and complex world was the most important part of the experience. It was also helpful to know that the learning process I am going through, and the frustrated feelings that sometimes arouse where not exclusively mine. Nevertheless, on hindsight, I thing that I could have used the supportive network the platform offers: emails, IRC channel, Phabricator page and other mediums, more than I did. Of course, this is just another thing that I still need to learn. I also feel really lucky to be given the opportunity to work on a project that allowed me to do things that were in the beginning beyond my current expertise in order to learn and improve my skills.
As I wrote in previously, this is just the middle of this project and I hope that in a month I will have a new post updating about our new and exciting developments.
More information and updates can be found here:
- Thoughts and practical directions about how to detect translated articles.
- My weekly progress report.
- The project’s GitHub where you can also find further documentation and basic results.
- Results for calculating similarity of Wikipedia languages.