You know how you have your hypothetical amazing version of yourself? For me, in that version I always have time for everything; classes, school chores, doing research, friends, partner in life, family and well, life. I counted, I need to be studying/working at least 11 hours a day to make ends meet, not counting all the other activities I want to do. So this is my excuse for not writing how my internship is going – I just didn’t get around to it.
So I’m Neta and I’m never on time. Next.
I’m doing a really cool project with the international team of the Wikimedia Foundation. We want to try and estimate the fraction of translated articles in a given Wikipedia versus originally written ones. It turns out that this number is unknown because when someone start a new article, they usually don’t indicate whether it a new original work or was it translated from another wiki. Moreover, even if they do, there isn’t a feature in the data base that saves it.
Why is it interesting? First and foremost, it gives information about the editors’ behavior and how they create new pages. For example, finding that 25% of the Hebrew wikipedia is translated from English means a totally different thing than finding that only 5% is translated because it will highlight how Hebrew is dependent upon English. Moreover, it could illustrate to the Wikimedia Foundation that they should put more emphasis on the content translation tool from English to Hebrew in order to increase the number of articles in Hebrew (try it here – its new and shiny!).
Another reason why it is interesting lies on the fact that editors on Wikipedia represents, though not fully, their culture. Whether an article is translated and from which language tells us something about the dependencies and relationships between these cultures. Furthermore, we can try and divide our results into categories and see if different classes of articles may be drawn from different cultures. In fact, it may well be the case that different cultures “specialize” in different content.
Unlike most projects in the program, my project is a research based one where the task itself, the method I should use to solve it and the ending point are unknown. I don’t even know if it is feasible or whether we would get meaningful results … but that’s research. Exciting!
As more than half of the internship has passed, I think that it is a good time to go back to the start. The beginning of my quest was not easy and contained a lot of searching and reading and not a lot of code. My first goal was to learn the wiki world from different angles while emphasizing content that is related to translations :
- Communication: Learning to communicate with other editors on village pump and on my talk page (its a weird concept, having my own billboard were people can write me messages and only hypothetically because my talk page is not interesting, everyone can read it).
- Structure (data): Understanding the schema and tables of the wikipedia data.
- Content: Getting lost in a good way in wikipedia in order to stumble upon relevant and interesting things related to translation. For example, did you know that in the Hebrew Wikipedia there is a page specifically for translation do and don’t.
- Accessing the data: Getting to know the mysterious ways of the dumps, the Wikipedia API, SSH and query tool that are part of Wikimedia too labs. Here is where I got drawn into trying to understand Wikidata which is the central storage the structured data of all the wiki projects. I failed in that… (Amir, my mentor, asked me to blog about my frustrations so I will expand this in a different post).
All in all, I think that the decision to learn before starting to code was a good one. I actually feel that I now know what I’m talking about and what I need to do. That not to say that I don’t feel lost sometimes or still don’t have things to learn (SSH and dumps – I’m looking at you) but it does say that when I wake up in the morning with a list of tasks for the day, I actually know how to start working on them, at least most of the time.
Next time I will tell you all about working with language links, which became my first programming assignment.