Grüße aus Deutschland

This post will discuss the two main German supermarket brands with stores in the Netherlands and their specific way to locate their stores. It has taken a whole week to index their store locations, here’s why.

The Aldi visitor

Until now, the Aldi site has been the most difficult to crawl and index. There seem to be a few reasons for this:

  • Their store locations are not being dealt with by themselves. Instead they use a third party to host their store locator.
  • The store locator only allows to find a number of stores in a radius around a certain geographical point, like a city or postal code. No way to get all stores in one go.
  • The found store pages are provided in a paged system.
  • Stores are returned disregarding country borders, so also from Belgium and Germany.
  • The third party and the retailer themselves are mainly working from Germany, and the mechanisms and content seems to be largely focused on that area, with the Nederlands being poorly supported in an attempt to reuse the same mechanism.

This is how the main store locator page looks like:

aldi-winkels

As said, our crawler needs to be quite smart to be able to index all store locations. It need to try and use geographical locations and a certain range around it to index a region, and do that until the entire area of the Netherlands is covered. For each area, it needs to know how to deal with the paging mechanism providing 5 locations at a time. And finally, it needs to check if the store is actually located in the Netherlands before it is added to the index.

coverageTo start of with, through trial and error, this is the coverage map of the Netherlands used to index the stores. The top most area somehow allows a radius of 100km to be indexed around Leeuwarden. For the remainder, there are 50km circles located around Amsterdam, Utrecht, Apeldoorn, Rotterdam, Goes, Breda, Eindhoven and Maastricht. This gives a full coverage of the Netherlands, but we will need to check the border area’s at some locations that bring intersections or regions close to the border. We might be missing stores at those edge cases.

The visitor implementation looks like this:

We can see the midline postal codes in the array. These will be iterated. For each postal code, the search area is set and the results are iterated. When at the last page (detected by the next button missing a link to the next page) the number of iterations is counted. If the last postal code has been processed, the stores index are synchronized with the Azure tables. Note that the phone number and the store location coordinates are not indexed. They are not available, since the map drawn is based on non-GPS coordinates. We need to do this later on by other means.

The Lidl visitor

Lidl is the second brand from Germany, and coincidental or not, it has exactly the same characteristics as the Aldi store locator, and some of those even worse. This is how their page looks like:

lidl-winkelsLuckily, there is an easy workaround for this site. On another page, it allows to download Point Of Interest (POI) files for different car navigation systems. These files contain all the information fields we need in a structured format. And because they generate and host these files themselves, they comply to our criteria for high quality data, coming from the retailer directly.

However, dealing with these files brings in new logic to our playground; the file unzipper! This is what the Lidl visitor looks like:

The requestor of the file is using the more low-level http library to get the resulting body into a buffer. This buffer is processed as being a zip-file by the NodeJS adm-zip package. For each file found in the zip file, it tries to identify a file with entries complying with the Dutch postal code format using a regular expression. The phone number is missing, so the needs to be added at a later time as well. But for now, this last visitor brings our total indexed shops close to our targeted 3000 for this week.

This weeks stats

We’re online with this list now for a week through the programmable web. It took a few days to get the API approved, and then I needed to extend the API description a bit more in detail before it came online end of the week. Initial results show that the API itself is not interrogated yet that much, but the blog about how it is conceived gets a lot of new hits, because it has been indicated as the primary blog on the API. Fun isn’t it …
newrelic-traces

The throughput of the NodeJS powered Azure Mobile Services is quite compelling. And we’re still in the free variant of these services, don’t need to pay a single dollar to get it up and running so far. Performance is good and sizing seems sufficient for now. I do expect to get all our 4000 stores in under limit using the information model provided.

That’s it again for this post. Next week we’ll see how far we get with indexing stores, and we need to put some additional effort into improving the completeness of information for the German stores. ‘Deutsche Gründligheid’ is maybe reality for car manufacturing, but I didn’t find the data for the Aldi and Lidl stores to have the level of quality and completeness that I need. But hey, that makes our data feed more valuable if we add that to the equation, doesn’t it?

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *