July 30, 2013

Paging Large Datasets in Semantic MediaWiki

I’ve been working with Yaron Korens new Miga project to provide an offline, mobile optimized way of accessing the nutrition data in Wikinosh. It works pretty well! However, we hit a couple of walls trying to get the 20,000+ food items in Wikinosh exported for use in Miga. If you are trying to page thru large data sets in Semantic MediaWiki read on.

Large Offsets

The best way to do this is to use queries of 500 items at a time and use incrementing offsets to retrieve each set of rows. This works really well and is easy to code. However, when you run this the first time if you have over 5,000 records you will notice that your while loop will never finish. Very odd. When you look at the data you will also see that after a few thousand rows, the first set of data just repeats over and over and you never reach the end.

A quick look in SMW_QueryProcessor.php on line 611 shows this:

$params['offset'] = array(
    'type' => 'integer',
    'default' => 0,
    'negatives' => false,
    'upperbound' => 5000 // TODO: make setting

Now that makes sense. It turns out that if your offset value is greater than 5,000 it won’t be used. To make matters worse, this doesn’t generate an error it simply ignores the offset. Ugh! The TODO is still a to do, there is no setting for this. If your working with large data, you will want to increase the offset to something large enough to accommodate your largest set. Since Wikinosh has 21,000+ items, I set it to 30000.

Query Max

Now that you have your offset changed you can go ahead and run your export again. It will chug along and then hit 10,000 records and stop. What? Sure, you increased your offset however you probably still have the default $smwgQMaxLimit value of 10000. So, if your offset is greater than the query max, you get 0 rows and your done. This setting does have a way to override, set $smwgQMaxLimit = 30000; in your LocalSettings.php and you will be ready to go.

With these two changes in place, I can now easily page through 500 records at a time and get everything out exactly as I expect.

There are currently two bugs open on these behaviors. Check those for current status of these settings: SMW Ask query offset has a hardcoded limit and SMW Ask query offset should error if maximum offset is exceeded.

Semantic MediaWiki Techie

Previous post
Don’t Ruin the Past with Reality I really enjoyed this article and photo for The 40-Year-Old Photo That Gives Us A Reason To Smile. Great story behind the photo, and it’s a
Next post
Password v. Passphrase Today’s xkcd highlights what might be one of the biggest security mistakes of our digital age: using passwords over passphrases. The comic does a