Job done!

This week we are closing a number of items we have been pursuing for a while. I’m writing the last post as part of a lift exercise to ‘blog more’, and have added the last few visitors to our store crawlers. Lets quickly list out what we have done.

The Vomar visitor

vomar

With store details on individual pages, the Vomar website has the same pattern of use as the Jumbo visitor. The main page contains a master list of pages to visit, preventing that we need to revert to brute force searching.

The MCD visitor

MCD

With all stores nicely listed on a single page, the MCD site is easy to crawl with Cheerio. Simple one lines do the job. Only thing missing are the latitude and longitude of store locations in parsable format. We will have to revert to geocoding based on the address as part of a later quality improvement effort.

The Agrimarkt visitor

Agrimarkt

The Agrimarkt visitor follows the same pattern again as the Vomar visitor. Nothing special to note, and no deviation from the pattern. Easy piecy.

The Jan Linders visitor

JanLinders

Jan Linders is mostly found in the south-eastern part of the Netherlands. It provides a clean JSON data stream for its store locations and enough detail information to not have to crawl the actual website.

The Nettorama visitor

Nettorama

Nettorama seems to position itself as a discounter on the dutch market like the German originated Aldi and Lidl. The store locations are embedded in data islands in a single page, which makes it really easy to index.

The Poiesz visitor

Poiesz

Never heard of it before, the Poiesz brand (don’t really know how to pronounce it correctly) seems to be a regional focused brand for the northern part of the Netherlands. Like the Nettorama store locator, the store webpage contains all the information we need to create entries in our database.

The Spar visitor

spar

Last, but effort-wise not least, is the Spar brand. Spar is a concept in which local store owners are collaborating under a common name, call a franchise. The site is server generated and keeps the results restricted to location. I’m not sure how much results we get returned and what the relevance is with either amount, or biggest distance. This is a good candidate to review for quality later on as well.

Some statistics

And with that last visitor created we finish our initial effort of indexing store locations. Here are some numbers:

  • Total number of stores in database: 4106
  • Total number of brands indexed: 25
  • Total number of visitors created: 22

From the initial list of brands in scope, Sanders was removed as it has been incorporated by the EMTÉ brand in 2011. Instead, Dagwinkel was added since it was automatically brought into the list by the Attent visitor. I made a correction to the Wikipedia page I used for reference to create my original list.

The list of stores per brand (concept) indexed as of March 9, 2014:

Brand
Stores
AH
816
Aldi
478
Jumbo
409
Lidl
382
Plus
254
C1000
251
Spar
235
Coop
143
EMTÉ
129
Troefmarkt
125
Attent
104
DekaMarkt
68
Deen
66
Poiesz
64
Hoogvliet
63
Vomar
60
Jan Linders
57
Dirk
55
CoopCompact
53
AH TOGO
46
Boni
39
MCD
35
SuperCoop
34
AH XL
34
Nettorama
31
Bas
30
Digros
17
AH DNTG
13
Dagwinkel
10
Agrimarkt
5

We will also provide this summary view through our API on short notice as well, which will then be automatically updated when things change.

With above content I will now leave you for a while. I am done with my lift exercise and need to spend some more time on another interesting topic.  I will blog about that as well and will be back with work on the Smarter Grocery App in the future, so stay tuned!

lift

Continuous Improvement

continual improvement process, also often called a continuous improvement process (abbreviated as CIP or CI), is an ongoing effort to improve products, services, or processes. These efforts can seek “incremental” improvement over time or “breakthrough” improvement all at once. Delivery (customer valued) processes are constantly evaluated and improved in the light of their efficiency, effectiveness and flexibility. – en.wikipedia.org

Got you thinking there! “What the h**l does that have to do with your app?!” Well, now we are closing in to the end of our list of stores to index, less attention is going into scripting the actual store visitors, and more focus is expected to go into quality improvement of the data. You will learn about the why and how later in this blog.

I’ll stop listing the visitor code in the blog posts, because it is starting to become quite repetitive in the way they work. Typically, I repeat the following sequence of events:

  1. Visit the website for the retailer, and open the page with the store locator
  2. In Fiddler, look at how the location data is loaded into the page
  3. Copy a completed visitor that closest matches the data pattern used
  4. Adjust the visitor to match the specific attributes for the brand
  5. Commit and push the new code to the Azure Mobile platform and execute the script to add the stores using the new visitor

Below I will just list the visitors added this week, with their specifics.

The Attent visitor (completed)

As started last week, the Attent visitor has been a pain-in-the-*** from the beginning. After two more days of fine-tuning the 20 km circles on the map to cover all area of the Netherlands using accepted postal codes, I got the damn thing running. Trick with Google maps API was the interface to use. I first started of with the client-side JavaScript Maps API which was not appropriate for my heavy usage. After switching to the Server Maps API with an API key, things started to work more smoothly.

The Troefmarkt visitor

Troefmarkt uses an external site that brings together a number of store brands that I suspect are all supplied by the same retailer. Advantage was that our system has already been prepared for these situations by introduction of the brand attribute to the Store entity in the information model. A single visitor to the external site can easily identify and index multiple brands, which it now does for both the Troefmarkt and the Dagwinkel at the same time.

This is what the lekkermakkelijk.nl site looks like:

Troefmarkt

Indexing the site itself was sort of like a similar nightmare as the one above, because it only returns a list of 9 stores closest to a postal code at a time. For this, we reuse the pattern of the Attent visitor created last week. Nice, but we need to add some validation logic that warns us if the most distant store in the list of 9 is less than 20 km from the center. In that case, there might be more stores in the circle than we got returned. Thoughts for later …

The DekaMarkt visitor

The DekaMarkt visitor is almost a one to one copy of the Hoogvliet visitor described in this post. Quite easy to index once you get the JSON structure from the body of the webpage:

dekamarkt

The Coop visitor

Coop has three store recepies that are easily identifiable in the JSON stream: Coop (regular), Super Coop (big) and Coop Compact (small). The brand attribute is use to discriminate between these species because we can expect price differences to apply to them.

coop

The Deen visitor

Deen is a relatively small brand, with a simple JSON stream feeding the store locations in total. Pattern-wise it is compatible to the C1000 visitor described in this post. Nothing more to say, here is the site:

deen

The Boni Visitor

Last this week, Boni is a twin sister of the Jumbo store locator in the way the data pattern looks like. The visitor was therefore made within 10 minutes, without much thinking, and as you can see from the long list of visitors this week, counting up to a total of 3619 stores indexed, it is becoming a repetitive, quick job.

This is what the store locator on their site looks like:

boni

 

Data Quality

As I go through the effort of creating the visitors, I encounter different levels of quality in the way the retailers maintain and expose their store location data. As one of our targets is to provide a high quality data stream, there are a number of points that I want to start working on after next week, when I expect the majority of store indexing to be completed:

  • Fill in the blanks
    A number of sites do not expose all their data in the stream. Some sites are missing postal code or phone numbers. Some are missing the coordinates. We will use the appropriate additional data sources to start filling in the values for these gaps.
  • Duplicate check
    Some of the retailers have duplicate entries for the same store. The name of the store is sometimes subtly different, but the address is exactly the same. We will search our database for such instances an prompt for resolution.
  • Check position
    Some latitude and longitude data retrieved from the website feels incorrect. Positions may be largely estimated, or just of wrong. We will use reverse geo-code to identify suspected instances and correct them.
  • Address validity
    The addresses can be validated using other sources than the retailer. We will check if the address, zip code and city are found in national databases and prompt for further investigation if not found.
  • Store completeness
    As indicated in this post for the Troefmarkt visitor, some of our mechanisms might miss a store or two in their indexing effort. Appropriate warnings need to be fired in case certain boundary conditions are met exposing a risk of missing essential data.

It is my intent to make retailers aware of mistakes in their data, so they can correct them at the source and everybody can benefit from that. Furthermore, in the near future, we will investigate contribution the Open Street Map initiative, which I feel is a visual counterpart of the Programmable Web that we already use to give away our data freely.

So much to do, so little time. See you next post!

 

 

 

The Pareto Principle

One of the variants of the Pareto Principle is that “Finishing the last 20% of the job is likely to take 80% of your time”. This principle seems to apply to the last stores to index, and the complexity of getting their locations. Last week was a battle, and while I added the Plus retail brand like a breeze this week, the Attent franchise chain is posing new increased challenges again. As the number of stores brought in per visitor gets lower, the effort increases.

The Plus visitor

Plus is one of the bigger brands in the Netherlands, and you can easily identify that by visiting its retail website, which looks quite professional. The store locator for their website looks like this:

plus-winkels

Luckily, their is a very easy JSON data feed coming into their search page that allows us to quickly gather their store locations in our database. The visitor is one of the most simplest we have created until now:

As can be identified, a simple JSON feed is consumable that has all the information embedded in it. The big supermarket brands understand the value of easy access to store information. Because, the easier shared, the more copied, the better found, the more customers.

Application of the JavaScript “eval” function is demonstrating the power of JavaScript objects as Data Transfer Objects (DTOs) in web applications, because each DTO can simply be brought to live as needed. Note that using the “eval” function is often regarded “evil” and an anti-pattern of use, because ANY code injected in our application will just be executed without escaping. But I did want to showcase at least once how easy and powerful it is to do it in the wrong way …

The Attent visitor (Work-In-Progress)

The challenge with the Attent store locator is directly visible when you take a look at their store locator functionality:

attent-winkels

The site only allows to stores in a radius of 20 km around a City or Postal Code. And because there are only around 100+ stores expected, how to safely find them? Certainly, not every position searched will give a hit.

This needs a form of “intelligent brute force”, a pattern that we have also already used for the Aldi visitor, but this time on a more local scale, because the range of search is very limited. In preparation, we need to find out which Cities or Postal codes to enter in a query to get a 100% coverage of the Netherlands area. For this, I wrote a quick JavaScript that does the following:

  1. Calculate the smallest rectangular bounding box around the Netherlands,
  2. Within the bounding box, iterate over the latitude and longitude in such a way that the areas of circles with a radius of 20 km cover all of the Netherlands,
  3. For each circle’s central latitude and longitude, get the closest matching postal code,
  4. For each valid Dutch postal code returned, perform a scripted search on the site.

I did finish step 2, which gives a picture like this:

NL circles

Now I need an automatic mechanism to give me back the postal codes for the circles centers … at least once. I tried to use Google Maps for that, but I seem to be overrunning my quota earlier than expected. Maybe too many requests at the same time, need to figure out why it only returns me information for some of the queries I perform.

At least a lesson learned already this week is that, while NodeJS is completely asynchronous and we’re supposed to keep applying our code to leverage that to the fullest, the website’s we’re visiting do not like to be stressed to return the information that much. Hence, a good timeout inside a Closure is a inevitable, even valuable asset in your toolkit applying NodeJS to other non-NodeJS web interfaces.

Next post we will likely be finishing the Attent visitor, and hopefully one or two more, which puts us behind schedule by a week … because I forgot about the golden 80/20 rule to start with …

Grüße aus Deutschland

This post will discuss the two main German supermarket brands with stores in the Netherlands and their specific way to locate their stores. It has taken a whole week to index their store locations, here’s why.

The Aldi visitor

Until now, the Aldi site has been the most difficult to crawl and index. There seem to be a few reasons for this:

  • Their store locations are not being dealt with by themselves. Instead they use a third party to host their store locator.
  • The store locator only allows to find a number of stores in a radius around a certain geographical point, like a city or postal code. No way to get all stores in one go.
  • The found store pages are provided in a paged system.
  • Stores are returned disregarding country borders, so also from Belgium and Germany.
  • The third party and the retailer themselves are mainly working from Germany, and the mechanisms and content seems to be largely focused on that area, with the Nederlands being poorly supported in an attempt to reuse the same mechanism.

This is how the main store locator page looks like:

aldi-winkels

As said, our crawler needs to be quite smart to be able to index all store locations. It need to try and use geographical locations and a certain range around it to index a region, and do that until the entire area of the Netherlands is covered. For each area, it needs to know how to deal with the paging mechanism providing 5 locations at a time. And finally, it needs to check if the store is actually located in the Netherlands before it is added to the index.

coverageTo start of with, through trial and error, this is the coverage map of the Netherlands used to index the stores. The top most area somehow allows a radius of 100km to be indexed around Leeuwarden. For the remainder, there are 50km circles located around Amsterdam, Utrecht, Apeldoorn, Rotterdam, Goes, Breda, Eindhoven and Maastricht. This gives a full coverage of the Netherlands, but we will need to check the border area’s at some locations that bring intersections or regions close to the border. We might be missing stores at those edge cases.

The visitor implementation looks like this:

We can see the midline postal codes in the array. These will be iterated. For each postal code, the search area is set and the results are iterated. When at the last page (detected by the next button missing a link to the next page) the number of iterations is counted. If the last postal code has been processed, the stores index are synchronized with the Azure tables. Note that the phone number and the store location coordinates are not indexed. They are not available, since the map drawn is based on non-GPS coordinates. We need to do this later on by other means.

The Lidl visitor

Lidl is the second brand from Germany, and coincidental or not, it has exactly the same characteristics as the Aldi store locator, and some of those even worse. This is how their page looks like:

lidl-winkelsLuckily, there is an easy workaround for this site. On another page, it allows to download Point Of Interest (POI) files for different car navigation systems. These files contain all the information fields we need in a structured format. And because they generate and host these files themselves, they comply to our criteria for high quality data, coming from the retailer directly.

However, dealing with these files brings in new logic to our playground; the file unzipper! This is what the Lidl visitor looks like:

The requestor of the file is using the more low-level http library to get the resulting body into a buffer. This buffer is processed as being a zip-file by the NodeJS adm-zip package. For each file found in the zip file, it tries to identify a file with entries complying with the Dutch postal code format using a regular expression. The phone number is missing, so the needs to be added at a later time as well. But for now, this last visitor brings our total indexed shops close to our targeted 3000 for this week.

This weeks stats

We’re online with this list now for a week through the programmable web. It took a few days to get the API approved, and then I needed to extend the API description a bit more in detail before it came online end of the week. Initial results show that the API itself is not interrogated yet that much, but the blog about how it is conceived gets a lot of new hits, because it has been indicated as the primary blog on the API. Fun isn’t it …
newrelic-traces

The throughput of the NodeJS powered Azure Mobile Services is quite compelling. And we’re still in the free variant of these services, don’t need to pay a single dollar to get it up and running so far. Performance is good and sizing seems sufficient for now. I do expect to get all our 4000 stores in under limit using the information model provided.

That’s it again for this post. Next week we’ll see how far we get with indexing stores, and we need to put some additional effort into improving the completeness of information for the German stores. ‘Deutsche Gründligheid’ is maybe reality for car manufacturing, but I didn’t find the data for the Aldi and Lidl stores to have the level of quality and completeness that I need. But hey, that makes our data feed more valuable if we add that to the equation, doesn’t it?

Gentlemen, start your engines!

This week we doubled the number of stores available, published the API at the Programmable Web, and hooked up New Relic to do some measurements on usage.

The Albert Heijn visitor

Albert Heijn is probably the largest retail supermarket brand in The Netherlands. They have got several recepies for stores depending on location and size. This is what their store locations page looks like:

ah_winkels

The page has a single feed that gives access to a list of almost 1000 shops. The list is embedded in simple readable HTML format, and can be easily parsed using Cheerio:

Nothing much to say, just some regular selectors, and string manipulation routines that give us a data representation consistent with previously created visitors.

Publishing the API

Now that the data model is stabilized and we are gaining some volume (we’re at almost 2000 entries, half of what we expect it to be in the end), it’s time to start publishing our service interface and let the world know we’re here. Well, for testing purposes at least.

The de-facto standard for publishing and finding public services APIs on the net is the Programmable Web site. Registration of the new API is filling-in a single page form:

programmable_web

After filling the form, the API needs to be ‘approved’, which means it is not available directly after registration. I’m not sure what happens during the review, but I guess some person will try to connect to the service using the URL and instructions provided. No problem, we’re ready for it, and we’ve opened up access to the data in Azure Mobile services:

Store_permissions

I will keep the application key secret for private use. This allows my scripts to be the only ones able to make changes to the tables. Public access is read only.

Getting some metrics

Important part of publishing the service API for public consumption  is knowing the popularity of usage of it. Certainly in a cloud based environment were in the end, I will need to cough up the money if consumption limits are passed, and usage goes through the roof. I don’t expect that to happen that quickly, but better to be safe than sorry.

My favorite ‘light-weight’ metric and performance tool for services is New Relic. It provides a nice analytics dashboard in the cloud. Support for Azure is build in, and activation is clearly described in this article.

newrelic

Now we can follow the number of requests made to the published service APIs and the duration each request takes. Quite elementary, and the ease of activation makes it a low investment to get started. We’ll look at some more detailed metrics acquired in the coming week in the next post.

My plan for next week is to increase the number of available store locations in our database with another 1000 stores. And if we set the same target for the week to follow, we are done with this exercise in two weeks from now, after which we can start to operationalize our store finder app as a first step in achieving our Smarter Groceries experience.

Facts of Life

Due to circumstances, progress has been a bit low this week. I did improve the NodeJS implementation stack considerably, and added the Digros / Bas / Dirk visitor as promised. But not a enough time to create the log email sender.

The Digros / Bas / Dirk visitor

In a previous post we already created a site crawler for the Digros / Bas / Dirk retail brand. But this was in C#. I moved it over to the NodeJS part of the universe, like this:

The structure of visitors is a little more simplified because of modifications to the StoreClient. Essentially, the visitor is becoming a plugin to a crawler framework more and more, only implementing overridable functions.

Furthermore, we see the first application of the Cheerio library, which helps us parse and interpret the DOM of webpages with jQuery like selectors. This page we’re indexing here is more structured, and lends itself for applying Cheerio effectively.

This exercise provides a nice opportunity to compare the C# code in the earlier post that piloted this visitor, with the new JavaScript code. The port demonstrates that the constructs can be kept quite comparable, and only language details need to be massaged a bit.

Azure Mobile Data and Logging

I though it would be nice to share some of the status we currently have achieved on the store crawlers. Let’s do that with some screenshots, easy and factual.

The database currently contains the location information for 555 stores. Compared to our estimated end total, this is about 15% of the total data capacity. Just a quick glance on part of the data:

database

The logging for the nightly job that is crawling the sites is shown right here:

logging

We can see what changes to the database are made for each visitor syncing their content, both in summary format and in detailed format for the changed stores. Just enough not to clutter the logs. Just want to push this info to the admins as an email, then we’re done.

Now the scripts start to stabilize and become more maintainable, and the effort of adding new visitor decreases, we will index some of the larger retail brands in the Netherlands to increase the volume of our database. Also planned is to investigate how we can register our data at the programmable web site. Hope to spend some time on that real soon, and share our effort with the world!

 

Been there, done that, got the T-shirt …

So, how was your week? Added a few visitors, improved the way the storage is accessed, and as a bonus fooled around with WebJobs. Let’s dive in!

My Store Client

One of the things I was complaining about is the fact that the actual storage of data could only be performed if my script was running in the Windows Azure cloud. This is because the scripts were using the global ‘tables’ object to access the data store, and that one is only available at the server side, and not on my development machine. This gives a nasty round trip in order to test and troubleshoot the actual storage part of the solution. But not anymore …

The Windows Azure Storage is accessible through the Mobile Services API, that’s the whole point of Windows Azure Mobile Services, duhhh. So why not have our own visitor scripts also use that API  to work with the data. For that purpose, I created a ‘store client’. It’s a piece of JavaScript code that performs the CRUD operations on our ‘Stores’ table through the Mobile Services API. Out of the box, the API supports the OData protocol, so our client only needs to be a thin wrapper around a NodeJS package that does OData. This is what it looks like:

So … what are we looking at? Starting with the ‘odata-cli’ NodeJS package with gives us a beauty of an OData client. This is what our CRUD operations AddStore, GetStore, UpdateStore and DeleteStore are using to communicate. The ‘onBeforeRequest’ function is injected as the request handler, for the purpose of assigning my super secret application key into the header of each request, by setting the X-ZUMO-APPLICATION value in the header.

Also part of this new client is my store definition and the code to synchronize the stores found during crawling with the already existing stores in the database. The code for comparison is already a bit more condensed and the public objects and functions are exposed through a nice module interface. Not yet as good as it can get, but we’re improving. This module provides a central reusable artifact for any other code that want to interact with the ‘store’ entity in our information model. Nice!

The Hoogvliet visitor

The first one to benefit from the Store Client is the brand new Hoogvliet visitor. This is what their page looks like:

Hoogvliet_winkels

And this is the code that crawls it, nice and compact:

Nothing much to say, or excitement going on. Just some JavaScript object embedded in the page that contains the data we need to fill our store object. Been there, done that, got the T-shirt …

A WebJob Jumbo crawler

jumbo_winkels

Based on a nice post that Scott Hanselman published on his blog on Windows Azure WebJobs, I just wanted to take a look at what it is compared to the Mobile Services Scheduler and NodeJS. An amazing experience! Why? The simplicity of it! All you have to do is create a command-line application, push it to a Windows Azure Website in a zip file and it is ready to rock and roll. So productive to be able to use C# and .NET again, after experiencing the newbie pain with asynchronous JavaScript in NodeJS. It was really a breeze. Here is the code:

While simple, a new crawl pattern is encountered. This site has some of the store details exclusively available through dedicated pages for each store. So after having indexed the number of stores and some location information, each individual page for each store needs to be retrieved and crawled to get the proper information to fill the store object. Well, something we would have had to deal with sooner or later. But not really a showstopper …

Again, WebJobs are a very simple and productive way to extend websites with scheduled jobs executing on the backend. But for our little pet project, it also brings in a dependency again on Azure Websites, because that’s where the scripts are hosted. And we’ve just done a nice piece of effort to get the storage scripts support off-server development. So we’ll consolidate the Jumbo, and our previously generated Digros / Bas / Dirk C# implementation crawlers on the NodeJS platform as well.

NodeJS tools for Visual Studio

Just before we ‘call it a week’, wanted to extend my thanks to Scott Hanselman again, also for another post on the NodeJS tools for Visual Studio. This tool set allows to debug NodeJS scripts on a development system as if it was native code, with all functionality of setting breakpoints, step-by-step execution, variable value evaluation and all other good stuff we are used to in Visual Studio. Great job guys! Now I’ve got no good reasons left anymore NOT to continue on this platform 🙂

Next post, we’ll see the C# crawlers ported to JavaScript and attached to the store client, and have some better consolidated logging to improve the monitoring in the Windows Azure console. And if there is time left, we’ll throw in some email sending capabilities to notify us when changes are happening during synchronization. Go, go, go!

A Parallel Universe

Just finished the postal code look-up using Google Maps Geocoding. Learned a lot on NodeJS and asynchronous programming, added another shop visitor and modularized the code a bit to get parts reusable.

Adding another visitor

The visitor added this week is for a retailer called C1000. It has 264 store locations to index, and the page crawled looks like this:

c1000_winkels

 

Well, actually looking at the page source code, it seems to be fetching the store locations asynchronously using AJAX calls to a JSON stream. So after fiddling around a bit, these are the only two interfaces to visit:

  • http://www.c1000.nl/webservices/c1000/wswinkelservice.asmx/GetGeoDataForAllWinkelsInJson
  • http://www.c1000.nl/webservices/c1000/wswinkelservice.asmx/GetNawForSpecifiedWinkelsInJson

The first stream delivers an id and location data for all available store locations. The second stream provides more specific info for a list of store ids. Since it does not seem to have a limitation on the number of stores in the list (to the second call) we just need to call the first stream, and get the ids of all stores into a single array to be able to retrieve the detailed information for all stores in one call using the second stream. Bit of JSON parsing, and we’re good to go!

For storing the new visitors’ data we want to use the same table as used for our first visitor, and we can re-use the script made for the first visitor for the same purposes. Since Windows Azure Mobile Services uses NodeJS as its core server scripting environment and has the RequireJS library preloaded, we can use this module dependency loading mechanism to modularize the code. I did this by moving the table storage part of the code into a separate JavaScript file (representing a module), like so:

Few specific things to mention here that differ from the previous posts’ version of the same functionality:

  • The ‘exports.Exec’ function is the signature used for the public module functions that are exposed. That way we are discriminating the internal from the external functions in the way it is supported by RequireJS.
  • The script checks for the storeTable variable to be other than null. In case I use my script locally on my development system, the database is not available, though I want to be able to debug my script. This is a poor mans’ solution to the script not breaking locally when debugging.
  • The console logging statements have disappeared. This has to do with some weird symptoms I was seeing due to the asynchronous nature of JavaScript parsing by NodeJS. The script was executing unstably, sometimes doing (logging) stuff, sometimes not, which I though was caused by the logging itself, but it was not … It was just STUPID me not understanding in enough detail how NodeJS takes asynchronous JavaScript processing to the next level. I will bring back appropriate logging in the next version.
  • The read statement on the table is prepended with an additional where clause that filters on a new column called ‘visitor’. This new column is needed to understand which visitor is indexing which store brands. As told before, some retailers use different combination of end-consumer visible store concepts, while their websites are shared. For example the DetailResults group has a single website for Bas van der Heijden, Dirk van den Broek en Digros. While syncing the visitor results with the table, it needs to know which rows to sync with.
  • The zipcode check has an additional criteria for not being null. Because the retailers website does not list the zipcode, the value is null during the sync exercise. Now if we correct the zipcode by other means in the table, we don’t want it to be overwritten by the visitor with a null value again.

The C1000 visitor itself looks like this:

Just a few notes of attention here as well:

  • The stream received from the website is not properly encoded, to say, it is JSON pushed on the line in a string-ified format. You often see this behavior with older fashioned ASMX websites pushing out JSON. But it is no real problem to the JSON parser preloaded in NodeJS, if the parse function is used.
  • The position information of the locations (latitude and longitude) is fixed to 5 decimals behind the separator. This is done to overcome the problem that the visitor gets a ‘real’ floating value, which is not stored as such by the database. The resulting effect is that every sync job will mark the location as changed, because the precision of the number is different in the database compared to the freshly retrieve values. Fixing the precision ‘fixes’ this behavior.
  • The ‘message’ variable is an array getting all the log statements pushed into it. In the end, I want my logging to be pushed to the browser that invoked my resource instead of to the logging console of the Azure platform. The intention it to join the array elements at the last point before the pre-loaded Express libraries’ send command is called, in order to get the final output stream to the browser. Note that Express does not support the ‘write’ or ‘end’ functions to be called in the response object. We can only call send one time to construct the final output page, instead of rendering it progressively using the write and end functions.

Give me some proper postal codes … PLEASE!

I did a small inventory of Dutch services that would enable me to get a zipcode returned for an address, in order to fill in the missing values for the EMTE visitor. I found some dubious ‘free’ sites that would allow me to download such information after registration, but in the end it was not free at all, only a kind of preview. Luckily, our friends at Google maps came to the rescue again.

Google Maps has a Geocoding API that does sort of what we want. By providing a partial address, Google returns the full address (including postal code) it has found. There is a restriction to this service that it is only allowed to be used when purposed to display the data points on a Google Map. In fact, it is my intention to do that with the data, but not directly, I will first cache it in my database. (feel comfortable enough with that to be applying to the rules). Furthermore, the daily limit is about 2500 searches and there is a rate limitation also in place. And of course during development using NodeJS I encountered  and was blocked by both, because the number of develop, run, test cycles with JavaScript is large and the async nature of NodeJS actually pushed out all queries in parallel, making it feel to Google like a DOS attack at some point in time, which was certainly not my intention ;-).

See the postal code visitor below:

And for this section the specifics are:

  • ‘ValidateEntries’ is a recursive function called for every entry that has a null value for zipcode in the database. The script is initially triggered by the exports.get function, which then triggers the loop by means of a timeout. The Closure concept seems a bit superfluous to be embedded, but trust me, without it your code will go haywire when trying to use the setTimeout function in the appropriate way.
  • setTimeout is used to relieve the stress from the Google Maps API by lowering the frequency to a call every 200 milliseconds. Because ValidateEntries is recursive, and uses the same mechanism to reschedule itself, use of setInterval is not needed and the single-shot setTimeout is enough.

After thoughts

More overall lesson learned this week, and it was a hard lesson, is that the impact of an asynchronous programming model on your developers’ skill-set is comparable to that moving from functional programming to object-oriented programming. It is not to be underestimated what the shift in paradigm needs for a learning curve and training effort. I typically like to do that as part of my work because the context brings in realistic scenarios encountered, and I can focus on resolving those with ad-hoc learning spikes. But for this one … a bit of training would have enabled me to do more functional stuff last week.

Async programming is like a parallel universe; quite the same, but quite different …

Lets bring in another visitor as a next exercise and see if we can tidy up the modules a bit further … in the next post!

Sneak Preview

It’s been a week since the last post on crawling retailers website to index their store locations, and what a week it has been! I’ve been spending 2 to 3 hours, almost every day falling in love with Azure Mobile Services. Well, sort of. Initially, we planned to only store the data in Azure. But after having read about the Azure Mobile scheduler service that is currently in preview, I couldn’t resists the temptation to play around with it. This post, I’ll tell you what I learned last week.

The Azure Mobile Scheduler service

When we are going to create a crawler to synchronize and validate our high quality store locations, we need to have a way of automatically performing that process on a regular basis. We’re not going to keep 4000 data items up-to-date manually. So if we’ve got a script that does the indexing and synchronization, it need to be executed once in a while. That is exactly what the Azure Mobile Scheduler service is intended for.

We created our script in ASP.NET and can expose its interface to a website. The scheduler service can be configured to retrieve that website page (read: kick-off the crawling exercise) on a regular basis. But the scheduler can also be given the entire script to index store locations itself. The benefit of doing that is that all service operations can be autonomously be executed in Azure Mobile services. We don’t need to host the ASP.NET website in order to get our stores indexed. Downside is that we cannot reuse our C# code, since Azure Mobile service scripts are essentially NodeJS JavaScripts.

“No problem, I need to dive into NodeJS anyway to understand what it is and find our to utilize it best. And I’ve sort of got the hang of JavaScript programming already, so what the heck … Lets go for it” (famous last words).

First evening was spend on installing NodeJS on my development system, getting some essential packages downloaded for the task at hand (like request and cheerio), getting the Azure Mobile Services to start using a Git instance for versioning the scripts, hooking it up to my Visual Studio instance, and testing the complete development round-trip. Works like a charm, only we’re one day further, and no value added to our solution.

Next evening I essentially made another visitor for another retailer brand called EMTÉ. This is what their store location page looks like:

emte_winkels

This page contains all information we need according to our information models’ definition of store properties, except for the zip-code of the location. We’ll ignore that one for now because we can later-on easily retrieve that using the address and city of the location.

The EMTÉ visitor

We essentially want to do the same as the previous visitor we made, but take it a step further:

  • Retrieve the webpage
  • Index store locations
  • Synchronize the found locations with a database

The first part was rather easy. Defining our Store object and reading the webpage in NodeJS Javascript goes something like this:

We indicate that we require the request package in our script at the top. Then we create a Store object in JavaScript with a fully qualified constructor (which we are actually not using in this example anymore at all, waste of time :-|), and finally we retrieve the webpage and check if all went fine.

Next up is indexing the store locations from the website. At first I was planning on using cheerio to parse the DOM like we did with the previous visitor in ASP.Net. But the webpage with store information uses so much unstructured JavaScript to actually keep the store locations itself, it did not make any sense to try to get that working effectively, and we resorted to old-fashion string parsing habits. The draft copy-paste code that I’ve got working looks something like this:

Yes yes, I know. Hurts my eyes too. Variables not properly scoped, dumb usage of string manipulation and can probably be done using one-liners if we use the correct regular expressions. But I just wanted it to work! Polishing can be done later, when we’re entering the production phase. Essentially, the routine is eating the page content and spitting out the information we are interested in. It gives us 129 store locations in the end.

Finally, storage. Or better said, store synchronization. We want the script to be updating existing entries is data has changed, insert new stores not yet found in the database, and remove stores from the database that have not been found on the webpage anymore. And all with some proper change logging going on to understand what changed and why. This is the purpose of the following piece of script:

After cleanup, most of this code will be moved to a re-usable function that will be delivering one-liners in the main script … but not right now. For each entry in the database, the script checks if the equivalent entry is still found on the webpage, and if, compares each of its properties. If anything changed, it gets updated in the database. If the database entry is not found on the page, it is removed from the database. Any entries on the page that have not been used in checking the database entries are new stores, and therefore need to be added.

The Sneaky part

Creating the complete script as such took about 2 evenings. The other three evenings were spend on troubleshooting and working around Azure Mobile Service “issues”. Well, not really issues maybe if you know how to do this all in the appropriate way from the start. But learning while using takes longer. For example, the Preview status for the Scheduler service is not just there because it looks pretty and new on the new capability. Some things are just not so stable yet. If the script is screwed up, any further updates to the script might not actually execute anymore, and one need to “reset” the Mobile services by Toggling the Dynamic Schema flag. Takes a few hours to figure out that one, with some help from Jeff Sanders!

And the logging is not always giving clear hints on what is actually wrong with a script. Took me some time to figure out that you cannot use a JavaScript object as the value of a dictionary if it is based on an Array type. Using an Object type dictionary was the solution. Finally, the script had to be edited and checked on the Azure Management portal itself, since the way the script gets and uses the tables for storage cannot be used on a development machine with Visual Studio at hand. That means, edit, save, run (since JavaScript is run-time interpreted), go to main page, open log, check result, find problem, go to editor again, etc. That round-trip is just awful. I will change the direct table access into service API data storage to be able to run the script on my local system, while the data is persisted in the cloud.

Rather long post, but a lot has happened over the week. Next week I will attempt to get the missing zip-code in place by searching for an online service that can provide me that information. And I’ll also try to get the script a bit more re-usable and maintainable, so that one or two more retail website visitors can be indexed using the same as well. That should keep me busy!

Sweet dreams …

KISS the DOM interpreter

This is the twelfth blog post, and the last in a rapid fire of LIFT supported blogging exercises as prescribed in the first post. I will continue to blog on achieving the Smarter Grocery Store App, but as the Christmas holiday is coming to an end, and my full-time day-job is about to start again, the frequency of publication will be lower. Less quantity, more quality is the intention, so stay tuned!

Crawling store locations

In an attempt to get a high quality list of store locations available, we’ve planned to create some site visitors that will index store locations from the retailers websites. I’ve created one visitor, as example of how to do such a thing. We’ll look at the ASP.NET C# code that does the job nicely. We’re still in pilot stage, so code quality is not brilliant, but just makes us aware what we need and how easy it is to achieve. We have targeted three brands, which share the same retailer organization, and a common website. These are Digros, Dirk van den Broek and Bas van der Heiden. The page we will index is https://www.lekkerdoen.nl/winkels. Screenshot below. bas_dirk_digros_winkels
The list at the bottom, is 101 stores long. It contains all information our information model Store entity properties require. I fetched a local copy of the HTML file for development purposes. You can fetch it yourself if you want to review the page source in detail, using the link above. I will only highlight the relevant sections for us. Our code uses two libraries, imported using NUGET into our Visual Studio project:

  • HtmlAgilityPack – Purposed to do just what we want, interpret the Document Object Model (DOM) of an external HTML page.
  • JSON.NET – Work with JavaScript Object Notation (JSON) objects in C# code.

Our code begins with loading the page into a document using the HtmlAgilityPack HtmlWeb object:

Once loaded we find and deserialize the JSON data embedded in the page that contains store details, including store location in latitude, longitude format and the specific brand the store is, e.g. Digros, Dirk or Bas. In serialized format the string looks like this:

We find the node with the following piece of code:

Then we extract and deserialize the JSON object like:

There is no need to perform defensive programming. If the page changes structure, I want my script to break and throw an exception. This way I am putting the least amount of effort in the Visitor itself and be aware of pages that are changed. It doesn’t yet need to be smarter than that! This gives us store details in an accessible format. Next up, the summary information visible on the page to the customer. We will use this as the main means of iteration and indexing, and inject detailed information as required from the prepared JSON array. First lets see how a store summary result is formatted on the page:

Pretty straightforward isn’t it, there are 101 entries on the page like this. Because of this repeated and consistent structure, parsing is super easy. First we find all DIVs with a class attibute of result, and we prepare a list of store objects which properties we will define on the go.

Then we loop all found result nodes and parse their content to fit our formatting rules for the store properties:

At this point, we incorporate reading the detailed information in the loop:

And that essentially does the trick! Our store entity used essentially became:

Very simple, yet pretty complete already. Again, no effort done to get a high quality display in place, just proofs that this is easy piecy. And we can do this also for the other retailers’ websites exposing store location information. Next blog post will describe how we can publish this data using Azure Mobile Services. Fun fun fun!