In part 1 I have talked about how language links are organized in Wikipedia (as part of the article’s code) and in Wikidata. Here I’m going to talk about how the data is stored and get some frustration out. In the next post, I will talk about my first task and its results.
Imagine that we want to use the data about the connections between the pages, or any other data in Wikipedia for that matter. It will be a real pain to check every script of every article to get that information. And anyway, these scripts should be saved and backed up somehow (and somewhere). To make the information more easily used and stored, Wikipedia (and Wikidata) are using structured databases. There are tables for page information, revisions, users, language links, text and for other functions (here is the full list of tables). However, there isn’t one database that combines all the information from different Wikipedias, each Wiki has its own database. There are three ways to reach this information: downloading the dumps (XML or SQL); working with the database by ssh (access to Wikimedia tools labs is obliged) or by querying the database directly using SQL quarry web tool (only needs a Wikipedia user account to use it); by using the API directly or with a designated package like pywikibot. Choosing the right tool depends on the specific question and the amount of data that is needed to answer it.
Although this is not the main focus of this post, because I was really frustrated to learn that there are only two, far from perfect, ways to extract text from pages, I want to construe the options to do so. If you need relatively low number of different texts then the best way is to call the language API and request the text. If on the other hand you need many more samples (or, taken to the extreme, the text of every Wikipedia page) then your only option is to use the XML dumps. I think the biggest difference between these methods is that the API is permitted to get into the database, take what is needed and then leave, while the dumps work in an all or nothing way. To find one page, you need to go over all the pages in that dump, which are usually the whole pages in Wikipedia or at minimum all of the new pages from that day or month. Anyway, it’s a lot. Fortunately, this limitation is only true for text.
Back to language links. Luckily, there is a table by the name of langlinks that is used for saving the links that typically appear in the left margin. The table is quite simple having three columns, page id (ll_from), language of the link (ll_lang), and title of the corresponding page in the other language (ll_title). Therefore, it is perfect for querying. The only downside, which I have already mentioned previously, is that every Wiki has its own langlinks table. This data structure is not too bad as we can call different databases in the same query. Yet, it’s still annoying. If you remember, I have talked about Wikidata which supposed to have a table that unites all of the language links for the connected pages. I thought it will be Wikidata’s langlinks table. I was wrong. It turns out WIkidata has its own special tables in addition to the regular Wiki tables that stores its data. The language links are stored in the table “wb_items_per_site” which is also a simple table. However, there are several problems with it (here comes another frustration): First, and that’s not the table’s fault, it was impossible to find. The Wikidata documentation doesn’t talk about these special tables because as it happens, these tables are part of Wikibase which is the software behind Wikidata (in hindsight, I guess it makes sense). Second, the documentation of the tables is just terrible. I managed to understand it only after looking at the table’s content. Last, it wasn’t organized correctly for the problem I wanted to solve.
Ok, finally I think I have nothing more to add about the structure so I can move on to talking about why language links are so important to my task (and therefore, I have spent endless words already on the subject).