A Parallel Universe

Just finished the postal code look-up using Google Maps Geocoding. Learned a lot on NodeJS and asynchronous programming, added another shop visitor and modularized the code a bit to get parts reusable.

Adding another visitor

The visitor added this week is for a retailer called C1000. It has 264 store locations to index, and the page crawled looks like this:

c1000_winkels

 

Well, actually looking at the page source code, it seems to be fetching the store locations asynchronously using AJAX calls to a JSON stream. So after fiddling around a bit, these are the only two interfaces to visit:

  • http://www.c1000.nl/webservices/c1000/wswinkelservice.asmx/GetGeoDataForAllWinkelsInJson
  • http://www.c1000.nl/webservices/c1000/wswinkelservice.asmx/GetNawForSpecifiedWinkelsInJson

The first stream delivers an id and location data for all available store locations. The second stream provides more specific info for a list of store ids. Since it does not seem to have a limitation on the number of stores in the list (to the second call) we just need to call the first stream, and get the ids of all stores into a single array to be able to retrieve the detailed information for all stores in one call using the second stream. Bit of JSON parsing, and we’re good to go!

For storing the new visitors’ data we want to use the same table as used for our first visitor, and we can re-use the script made for the first visitor for the same purposes. Since Windows Azure Mobile Services uses NodeJS as its core server scripting environment and has the RequireJS library preloaded, we can use this module dependency loading mechanism to modularize the code. I did this by moving the table storage part of the code into a separate JavaScript file (representing a module), like so:

Few specific things to mention here that differ from the previous posts’ version of the same functionality:

  • The ‘exports.Exec’ function is the signature used for the public module functions that are exposed. That way we are discriminating the internal from the external functions in the way it is supported by RequireJS.
  • The script checks for the storeTable variable to be other than null. In case I use my script locally on my development system, the database is not available, though I want to be able to debug my script. This is a poor mans’ solution to the script not breaking locally when debugging.
  • The console logging statements have disappeared. This has to do with some weird symptoms I was seeing due to the asynchronous nature of JavaScript parsing by NodeJS. The script was executing unstably, sometimes doing (logging) stuff, sometimes not, which I though was caused by the logging itself, but it was not … It was just STUPID me not understanding in enough detail how NodeJS takes asynchronous JavaScript processing to the next level. I will bring back appropriate logging in the next version.
  • The read statement on the table is prepended with an additional where clause that filters on a new column called ‘visitor’. This new column is needed to understand which visitor is indexing which store brands. As told before, some retailers use different combination of end-consumer visible store concepts, while their websites are shared. For example the DetailResults group has a single website for Bas van der Heijden, Dirk van den Broek en Digros. While syncing the visitor results with the table, it needs to know which rows to sync with.
  • The zipcode check has an additional criteria for not being null. Because the retailers website does not list the zipcode, the value is null during the sync exercise. Now if we correct the zipcode by other means in the table, we don’t want it to be overwritten by the visitor with a null value again.

The C1000 visitor itself looks like this:

Just a few notes of attention here as well:

  • The stream received from the website is not properly encoded, to say, it is JSON pushed on the line in a string-ified format. You often see this behavior with older fashioned ASMX websites pushing out JSON. But it is no real problem to the JSON parser preloaded in NodeJS, if the parse function is used.
  • The position information of the locations (latitude and longitude) is fixed to 5 decimals behind the separator. This is done to overcome the problem that the visitor gets a ‘real’ floating value, which is not stored as such by the database. The resulting effect is that every sync job will mark the location as changed, because the precision of the number is different in the database compared to the freshly retrieve values. Fixing the precision ‘fixes’ this behavior.
  • The ‘message’ variable is an array getting all the log statements pushed into it. In the end, I want my logging to be pushed to the browser that invoked my resource instead of to the logging console of the Azure platform. The intention it to join the array elements at the last point before the pre-loaded Express libraries’ send command is called, in order to get the final output stream to the browser. Note that Express does not support the ‘write’ or ‘end’ functions to be called in the response object. We can only call send one time to construct the final output page, instead of rendering it progressively using the write and end functions.

Give me some proper postal codes … PLEASE!

I did a small inventory of Dutch services that would enable me to get a zipcode returned for an address, in order to fill in the missing values for the EMTE visitor. I found some dubious ‘free’ sites that would allow me to download such information after registration, but in the end it was not free at all, only a kind of preview. Luckily, our friends at Google maps came to the rescue again.

Google Maps has a Geocoding API that does sort of what we want. By providing a partial address, Google returns the full address (including postal code) it has found. There is a restriction to this service that it is only allowed to be used when purposed to display the data points on a Google Map. In fact, it is my intention to do that with the data, but not directly, I will first cache it in my database. (feel comfortable enough with that to be applying to the rules). Furthermore, the daily limit is about 2500 searches and there is a rate limitation also in place. And of course during development using NodeJS I encountered  and was blocked by both, because the number of develop, run, test cycles with JavaScript is large and the async nature of NodeJS actually pushed out all queries in parallel, making it feel to Google like a DOS attack at some point in time, which was certainly not my intention ;-).

See the postal code visitor below:

And for this section the specifics are:

  • ‘ValidateEntries’ is a recursive function called for every entry that has a null value for zipcode in the database. The script is initially triggered by the exports.get function, which then triggers the loop by means of a timeout. The Closure concept seems a bit superfluous to be embedded, but trust me, without it your code will go haywire when trying to use the setTimeout function in the appropriate way.
  • setTimeout is used to relieve the stress from the Google Maps API by lowering the frequency to a call every 200 milliseconds. Because ValidateEntries is recursive, and uses the same mechanism to reschedule itself, use of setInterval is not needed and the single-shot setTimeout is enough.

After thoughts

More overall lesson learned this week, and it was a hard lesson, is that the impact of an asynchronous programming model on your developers’ skill-set is comparable to that moving from functional programming to object-oriented programming. It is not to be underestimated what the shift in paradigm needs for a learning curve and training effort. I typically like to do that as part of my work because the context brings in realistic scenarios encountered, and I can focus on resolving those with ad-hoc learning spikes. But for this one … a bit of training would have enabled me to do more functional stuff last week.

Async programming is like a parallel universe; quite the same, but quite different …

Lets bring in another visitor as a next exercise and see if we can tidy up the modules a bit further … in the next post!

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *