This week we are closing a number of items we have been pursuing for a while. I’m writing the last post as part of a lift exercise to ‘blog more’, and have added the last few visitors to our store crawlers. Lets quickly list out what we have done.
The Vomar visitor
With store details on individual pages, the Vomar website has the same pattern of use as the Jumbo visitor. The main page contains a master list of pages to visit, preventing that we need to revert to brute force searching.
The MCD visitor
With all stores nicely listed on a single page, the MCD site is easy to crawl with Cheerio. Simple one lines do the job. Only thing missing are the latitude and longitude of store locations in parsable format. We will have to revert to geocoding based on the address as part of a later quality improvement effort.
The Agrimarkt visitor
The Agrimarkt visitor follows the same pattern again as the Vomar visitor. Nothing special to note, and no deviation from the pattern. Easy piecy.
The Jan Linders visitor
Jan Linders is mostly found in the south-eastern part of the Netherlands. It provides a clean JSON data stream for its store locations and enough detail information to not have to crawl the actual website.
The Nettorama visitor
Nettorama seems to position itself as a discounter on the dutch market like the German originated Aldi and Lidl. The store locations are embedded in data islands in a single page, which makes it really easy to index.
The Poiesz visitor
Never heard of it before, the Poiesz brand (don’t really know how to pronounce it correctly) seems to be a regional focused brand for the northern part of the Netherlands. Like the Nettorama store locator, the store webpage contains all the information we need to create entries in our database.
The Spar visitor
Last, but effort-wise not least, is the Spar brand. Spar is a concept in which local store owners are collaborating under a common name, call a franchise. The site is server generated and keeps the results restricted to location. I’m not sure how much results we get returned and what the relevance is with either amount, or biggest distance. This is a good candidate to review for quality later on as well.
And with that last visitor created we finish our initial effort of indexing store locations. Here are some numbers:
- Total number of stores in database: 4106
- Total number of brands indexed: 25
- Total number of visitors created: 22
From the initial list of brands in scope, Sanders was removed as it has been incorporated by the EMTÉ brand in 2011. Instead, Dagwinkel was added since it was automatically brought into the list by the Attent visitor. I made a correction to the Wikipedia page I used for reference to create my original list.
The list of stores per brand (concept) indexed as of March 9, 2014:
We will also provide this summary view through our API on short notice as well, which will then be automatically updated when things change.
With above content I will now leave you for a while. I am done with my lift exercise and need to spend some more time on another interesting topic. I will blog about that as well and will be back with work on the Smarter Grocery App in the future, so stay tuned!