Sneak Preview

It’s been a week since the last post on crawling retailers website to index their store locations, and what a week it has been! I’ve been spending 2 to 3 hours, almost every day falling in love with Azure Mobile Services. Well, sort of. Initially, we planned to only store the data in Azure. But after having read about the Azure Mobile scheduler service that is currently in preview, I couldn’t resists the temptation to play around with it. This post, I’ll tell you what I learned last week.

The Azure Mobile Scheduler service

When we are going to create a crawler to synchronize and validate our high quality store locations, we need to have a way of automatically performing that process on a regular basis. We’re not going to keep 4000 data items up-to-date manually. So if we’ve got a script that does the indexing and synchronization, it need to be executed once in a while. That is exactly what the Azure Mobile Scheduler service is intended for.

We created our script in ASP.NET and can expose its interface to a website. The scheduler service can be configured to retrieve that website page (read: kick-off the crawling exercise) on a regular basis. But the scheduler can also be given the entire script to index store locations itself. The benefit of doing that is that all service operations can be autonomously be executed in Azure Mobile services. We don’t need to host the ASP.NET website in order to get our stores indexed. Downside is that we cannot reuse our C# code, since Azure Mobile service scripts are essentially NodeJS JavaScripts.

“No problem, I need to dive into NodeJS anyway to understand what it is and find our to utilize it best. And I’ve sort of got the hang of JavaScript programming already, so what the heck … Lets go for it” (famous last words).

First evening was spend on installing NodeJS on my development system, getting some essential packages downloaded for the task at hand (like request and cheerio), getting the Azure Mobile Services to start using a Git instance for versioning the scripts, hooking it up to my Visual Studio instance, and testing the complete development round-trip. Works like a charm, only we’re one day further, and no value added to our solution.

Next evening I essentially made another visitor for another retailer brand called EMTÉ. This is what their store location page looks like:

emte_winkels

This page contains all information we need according to our information models’ definition of store properties, except for the zip-code of the location. We’ll ignore that one for now because we can later-on easily retrieve that using the address and city of the location.

The EMTÉ visitor

We essentially want to do the same as the previous visitor we made, but take it a step further:

  • Retrieve the webpage
  • Index store locations
  • Synchronize the found locations with a database

The first part was rather easy. Defining our Store object and reading the webpage in NodeJS Javascript goes something like this:

We indicate that we require the request package in our script at the top. Then we create a Store object in JavaScript with a fully qualified constructor (which we are actually not using in this example anymore at all, waste of time :-|), and finally we retrieve the webpage and check if all went fine.

Next up is indexing the store locations from the website. At first I was planning on using cheerio to parse the DOM like we did with the previous visitor in ASP.Net. But the webpage with store information uses so much unstructured JavaScript to actually keep the store locations itself, it did not make any sense to try to get that working effectively, and we resorted to old-fashion string parsing habits. The draft copy-paste code that I’ve got working looks something like this:

Yes yes, I know. Hurts my eyes too. Variables not properly scoped, dumb usage of string manipulation and can probably be done using one-liners if we use the correct regular expressions. But I just wanted it to work! Polishing can be done later, when we’re entering the production phase. Essentially, the routine is eating the page content and spitting out the information we are interested in. It gives us 129 store locations in the end.

Finally, storage. Or better said, store synchronization. We want the script to be updating existing entries is data has changed, insert new stores not yet found in the database, and remove stores from the database that have not been found on the webpage anymore. And all with some proper change logging going on to understand what changed and why. This is the purpose of the following piece of script:

After cleanup, most of this code will be moved to a re-usable function that will be delivering one-liners in the main script … but not right now. For each entry in the database, the script checks if the equivalent entry is still found on the webpage, and if, compares each of its properties. If anything changed, it gets updated in the database. If the database entry is not found on the page, it is removed from the database. Any entries on the page that have not been used in checking the database entries are new stores, and therefore need to be added.

The Sneaky part

Creating the complete script as such took about 2 evenings. The other three evenings were spend on troubleshooting and working around Azure Mobile Service “issues”. Well, not really issues maybe if you know how to do this all in the appropriate way from the start. But learning while using takes longer. For example, the Preview status for the Scheduler service is not just there because it looks pretty and new on the new capability. Some things are just not so stable yet. If the script is screwed up, any further updates to the script might not actually execute anymore, and one need to “reset” the Mobile services by Toggling the Dynamic Schema flag. Takes a few hours to figure out that one, with some help from Jeff Sanders!

And the logging is not always giving clear hints on what is actually wrong with a script. Took me some time to figure out that you cannot use a JavaScript object as the value of a dictionary if it is based on an Array type. Using an Object type dictionary was the solution. Finally, the script had to be edited and checked on the Azure Management portal itself, since the way the script gets and uses the tables for storage cannot be used on a development machine with Visual Studio at hand. That means, edit, save, run (since JavaScript is run-time interpreted), go to main page, open log, check result, find problem, go to editor again, etc. That round-trip is just awful. I will change the direct table access into service API data storage to be able to run the script on my local system, while the data is persisted in the cloud.

Rather long post, but a lot has happened over the week. Next week I will attempt to get the missing zip-code in place by searching for an online service that can provide me that information. And I’ll also try to get the script a bit more re-usable and maintainable, so that one or two more retail website visitors can be indexed using the same as well. That should keep me busy!

Sweet dreams …

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *