Saturday, April 07, 2007

You Can Take this State And Shove It (Where the Cache Don't Shine)

So REST implies statelessness. But very few of the "standard" GIS interfaces are stateless... and if you pretend they are the performance is tragic. So can they even be considered, or do they need to be papered over? Those persistent or bored enough to read to the conclusion of our little series will eventually learn this is a bit of a red herring. Indulge me. We'll talk about statelessness and its kissing cousin cacheability &mdash and even a little URI construction — today as we tour eligible GIS protocols on our search for the applicability of REST to the GIS world.

The Files Are Okay...?

Open file-based formats (SHP, TIF, etc.) can be transported by any number of statless (enough) protocols: http, ftp, etc. Well, except that they're so big. Implementing raster formats, for instance, has little to do with figuring out what to do with the bits and bytes or even compressing them, but has to do with efficient access to smaller portions of these rasters. The tiles, pyramids, etc. all serve to allow someone to make use of a 3TB dataset a few MB at a time. And most professional-grade applications (like software updaters) that have to reliably download extremely large datasets use or implement a restartable protocol. So simply performing a GET on the whole file — while simple and stateless — is not very practical. In addition, updates to these resources typically require random access to the file, you're not going to want to POST and update the entire dataset. So we want read and write operations to be more straightforward than streaming masses of dumb files. File protocols are merely okay — we can read and write them. But they're not great because those read and updates are likely to be extremely inefficient.

This means that subsets of the data should be easily addressable as resources to allow efficient, stateless access. It is certainly possible to throw every polygon you've got into its own shapefile, and allow their enumeration through a simple directory of hyperlinks to those files. But this style of data management is completely counter to how all the GIS libraries operate, and is indeed running against the grain of systems engineering in general. So I don't think simple path-name and file-based access will cut it. It's time to head to the world of databases.

Protocols of the ESRI stack

Without any hard stats whatsoever to back me up, I'll claim that half of the GIS datasets with legal or economic standing are generated with ESRI software. My careful wording excludes the (probably much more massive volumes of) data produced by scientists in the academic or war-waging worlds. Anyone know the real stats for how much of our world's real estate parcels, public land records, oil company studies, atlases on bookshelves, etc. are generated or edited with ESRI software? Anyway, the world of ESRI is the world of the "geodatabase"

The geodatabase is anything but stateless. It lives on top of a series of increasingly more serious databases. Databases are at the bottom of essentially every useful pile of software on the planet, but everyone seems to agree that no one at the other end of a RESTy (read: Internet) connection should talk directly to one. You can't trust people, you can't have all those connections open at once, the locking semantics are a nightmare, etc. It's just not a stateless world.

Recognizing this, ESRI has now proclaimed ArcGIS Server (and it's older, now-read-headed stepchild ArcIMS) as the protocol of choice. ArcIMS can talk via the well-thought-out AXL protocol which is pretty stateless and pretty sensible. They can dress both ArcGIS Server and ArcIMS in a nice OGC WMS wrapper, which makes pretty pictures easily readable.

But reading isn't the really fun part. I want to write, to publish. Even if we're willing to drink the Kool-Aid and use ArcGIS server to publish feature classes and rasters, are we going to open up our universe to the intricate, definitely very stateful DCOM? ESRI has made it very clear: they're willing to support little toys like open specs so long as someone else does ths work. But they are adamant that all data creation take place on the highly-expensive desktop monopoly known as ArcGIS Desktop. The new mantra is "author... publish... use!", where the "author" part is most emphatically to be done with a $15,000 dongle attached to your PC! (See slides 26 & 28 from the ESRI 2007 Developer's Summit Keynote. Now there's a fun example of the files URL and statelessness conundrum. That's a big heavy slide deck, and I only wanted to call out one or two slides. Keep that in mind as you download the whole gratuitously beautiful thing and scroll to that slide. It'll be important later. What if you wanted slide #27 later?)

In summary: I'll happily write off the whole ESRI software stack as the surface of a RESTful GIS architecture for these two reasons.

  • Publishing Uses the Wrong Protocols. Publishing in the ESRI world consists of nothing more than configuring extremely expensive proprietary software to serve read-only copies of data held in geodatabases (if you're lucky) or files. There's little facility for an end user to publish new data to an ArcGIS sever, or tie together data from arbitrary (stateless, cacheable!) sources. That's the dirty secret: ArcGIS server is really a SOAP and/or DCOM-happy catalog for other ESRI data sources. If you didn't like the unorganized and hard-to-scale model of "throwing your MXDs on a shared drive", you're not likely to like the SOAPy version of the same thing: "throw your MXDs on a shared SOAP server and serve them up using a thin patina of WMS". No, it's the wrong set of capabilities — there's no ability to truly post self-referencing features, rasters, attributes, etc. in a coherent data-managed whole. It's still MXDs pointing to geodatabases, seas of XML notwithstanding.
  • Not the Right Economic Incentives. It's hard to fault ESRI for their position that all GIS data should be authored on ArcGIS desktop. For certain tasks, it's not a bad platform, and it has lots of support from a large vendor community. But it's simply too profitable for them cast aside in the name of a more machine-friendly, Internet-friendly authoring and publishing paradigm. (Any punters out there know whether ArcGIS server is profitable?) There may be developers at ESRI just itching to open up spatial data management to REST-like protocols, but a few thrown chairs in meetings and occasional visits to Mr. Dangermond's office will have the suitable chilling effects. After all, someone's got to pay the bills. Seriously, I can't blame them, wish I had Jack's bank account, etc., but it does mean it's hard to pin much hopes on this sort of innovation coming from Redlands; in the same way it's hard to imagine Microsoft will release the next truly revolutionary Internet or browser innovation, even if someone in Microsoft Research invents it.

The OGC contributions... or the tyranny of the FID

We've covered WMS several times in this series tangentially. Summary: it's cool. It's stateless, it serves up content that can be used to find other content, etc. But how can I publish data to a WMS server? Basically, by getting on the phone to the site administrator. Not stateless, not self-describing, not machine-friendly, not cool.

That's where WFS comes in. It's stateless, its responses are generally self-describing. I approve. Let's skip straight ahead to protocol URL particulars: there is some good stuff in here, too. For instance, section 7.1 of the spec states "every feature instance that a particular WFS implementation can operate on is uniquely identifiable". 7.1.1 even talks about globally unique identifiers. But reality strikes. ESRI doesn't support the insert/update functionality of WFS (see above cynicism!). One big open source WFS effort (GeoServer) basically passes underlying datasource FID fields through as IDs, I don't view this as very self-describing. If I want to pass around the URL of a record describing the parcel of land my house sits on, no one's going to be impressed I tell them it involves knowing that it was the 87,113th record of a SHP file. (And that was before I updated it, giving it a new FID, which I'm sure will happen despite the requirement WFS imposes that the id not change after an update...). Wouldn't the parcel number or address be more learnable and interactable?

In other words WFS gives us a common language, but lousy guarantees that the resources we talk to are findable in a natural way. And this might be splitting hairs, but the model is that the server does work on the features on your behalf (including transactionality), not that you are interacting directly with the resources (features). So I see a good start with WFS, but there's something missing. I'll come back to what it is — and it's ultimately the issue lying beneath all this general dissatisfaction with the GIS software stack.

All Hail GDAL/OGR

And now a very brief aside into a tiny technical issue. GDAL (for rasters) and OGR make no claims to implement client-server protocols, but I bring them up because they must form the foundation of any realistic large-scale implementation of public/open GIS servers. (I'm happy to be proven wrong by GeoTools or hear about other options.) If you want stateless, cacheable architectures you've got to scale on Internet servers. Which means you have to be multithreaded. GDAL came of age when this was not a consideration, so it's hard to blame them, but the lack of thread-safety is a practical technical obstacle to Internet-scale services. The problem is being addressed, but as with all open-source projects, it's probably a matter of a large corporate sponsor needing the work done or a brilliant programmer getting suddenly bored or laid off before it happens. You never know, try sending Frank a Rush t-shirt and see what happens.

And So What Now?

I've covered the "traditional" GIS-centric protocols with which I am most familiar. I know there is a world of CAD and other GIS vendors (including open source stuff) out there that is getting swept under the proverbial rug. But this survey paints a somewhat patchy picture. Some aspects of the GIS stack are good: WMS, GML, WFS, shapefiles, and some are not so good: ESRI geodatabases, WFS transactions. (And some need roof repairs but are still attractive: GDAL/OGR.)

Next time we'll look at KML, GeoRSS (and its distant relative APP) — the kids that seem to be hanging out at all the popular bars these days — and what they might teach us. All on our way to the conclusion that may have been not-so-subtly coming through in my commentary: there is no spoon.

Friday, April 06, 2007

Rosa Parks Rides the Enterprise Service Bus

We're talking about REST these days and left off musing about enterprise protocols more exotic than HTTP. What does it even mean to be RESTful without HTTP? To the great striving middle class of programmers, it seems like the programming universe is being rent asunder by incredibly boring arguments about HTTP verbs, tortured analogies with SQL, and general ennui.

In the midst of this mayhem, the marketing drones interject their RESTful sales pitches for "Enterprise Service Buses" blissfully unaware that they are only getting paid because the world accidentally invented the World's Biggest Enterprise Service Bus (hint: it's called The Internet). They are able to walk the path from "the Internet is good", to "the Internet is mostly REST", to "we should make money off REST by convincing people they need expensive message passing software" without getting dizzy -- probably thanks to fancy SkyMall devices not available to the REST of us.

And honestly, if REST is an architectural state of mind — the journey, not the destination — why do I have to read endlessly about whether browsers support PUT and DELETE? So lets ignore HTTP for a while and talk about REST. The rest of us should be allowed to ride this bus after all. This wil take some time, so this article merely lays out the rules of the ride: and we are not giving up our seat.

According to Roy Fielding according to Wikipedia (and when has that ever been wrong?)

REST enables intermediate processing by constraining messages to be self-descriptive: interaction is stateless between requests, standard methods and media types are used to indicate semantics and exchange information, and responses explicitly indicate cacheability.
  • Self-Descriptive. I covered this last time. The GIS world is in decent shape here with broad adoption of standards from vectors to rasters to projections. (I'll have to save my ""so why does ESRI think they need to invent another raster and vector format?" rant for another time.)
  • Stateless. The cool kids like the word idempotent because it almost sounds naughty. Statelessness is tough. In the projects I work on, most of our GIS data comes from Oracle via SDE. Now there's a connection which is far from stateless -- the whole thing is predicated on transactionality, limited concurrency. To add insult to injury, the entire ESRI software stack is spectactularly un-resilient to connection failure with a geodatabase. I work in oil and gas, so OpenSpirit is everywhere. This is a shared-memory and RPC protocol fresh off the design shelves of the 1980s, complete with leg warmers and stubble.
  • Cacheability. This matters a lot for GIS with such staggering volumes of data. Google Maps didn't take off because the map was slidey -- it took off because it was fast, and that happened because of caching. But the public internet relies on HTTP caching proxies for the magic of tiled maps and blog feeds. What do we do in the enterprise? You're going to cache an Oracle response? Caching also matters for editing if we're talking about seriously complex data like is found in oil and gas, where even main memory never seems fast enough.

And we haven't even started talking about giving every blessed polygon a name -- a URI. Since URIs start with protocol specifiers, I guess we have more to do. We'll start with stateless next time.

Wednesday, April 04, 2007

God REST Ye Merry Polygons

Since REST is all the rage these days, it must be about time to ask what it means for GIS. A lot of the striving middle class of programmers believes that REST is about not using SOAP. Questions of hygiene aside, it's certainly about a lot more than that. Is GIS ready for REST interfaces that are more than warmed-over POST interfaces?

REpresentational

REST is about exchanging representations of data. You've got to speak a common language. Here, surprisingly, GIS is in good shape. The Open Geospatial Consortium (OGC) guys have had a good amount of luck getting some useful standards into widespread use. It's true that WMS and GML are pretty big nasty specs, but they have real traction with a lot of vendors. WMS was turned on its head by Google's tiling, but the open source community[*] is back with a tiling idea. Google launched KML on an unsuspecting universe, but if you grok GML you can definitely grok KML. And we can even thank Big Oil for funding the only sensible world-wide collection of projection systems, the erstwhile EPSG database. GIS data can be transported around via XML and everyone can agree on what it means. Good deal.

State Transfer

Okay, here OGC gets less important. OGC talks about transactional feature operations in their Web Features Service (WFS) spec. It seems like people are slowly supporting this spec, but it is firmly in the older school of handling transaction and data transfer in an RPC-style, rather than in a resource-focused style. Of course most heavy lifting is still done with your choice of GDAL and OGR drivers, spatial database extensions, and CAD & GIS packages operating on files. State Transfer in the public standards-based GIS world (where REST really starts to achieve critical mass) is still all about the read-only flow of data from publishers to browsers.

The exception to all of this, and arguably the only end-user publishing game that matters, is Google Maps/Earth world, where users can post data using GeoRSS and KML. Real world resources (at random URIs) can be tagged with KML and found by Google. So for actually sharing data, not just pictures, KML is the name of the game.

So What About Enterprise GIS Data?

Indeed. Companies are less interested in the exact GPS locations of random jogger X's morning workout routine but in having a cohesive architecture for all their money-producing geolocated data, like oil wells, supermarkets and garbage trucks. Just becaues those resources have lat/longs doesn't mean that GIS should suddenly be the center of the universe. Why do I have to use KML for resources that probably have extremely rich alternate representations already? And what URLs do I give them? On the Internet, everything is usually an http:// or an ftp:// away -- there are only so many protocols in common use. But on an enterprise network there are RDBMS servers, file shares, SDE, WMS, random shared-process based middleware, and so on. Do these get their own protocols? To (finally) get to the topic this article's title suggets: what does a polygon's URL look like? Does the HTTP-centric REST even matter in a world of non-HTTP protocols -- protocols which are largely stateFUL, note stateLESS? Questions I will be grappling with over the coming articles. Stay tuned.

[*] This article originally stated that tiling was an OGC effort, which is not correct. A poster pointed out that the tiling effort I referenced was not OGC. I suspect OGC will not be far behind.

Monday, April 02, 2007

Hey ASP.NET: File it in the Circular References File

Quick quiz: What can make your ASP.NET 2.0 web application compile about 50% of the time? You know, now-it-compiles, now-it-doesn't? Circular References.

The designers of ASP.NET 2.0 were extremely worried about performance, especially for large websites. (Where large is a lot of controls, not a lot of visitors.) The ASPX 1.1 behavior of compiling an entire website's ASCX/ASPX markup was simply crippling to developers. So the ASPX 2.0 guys figured, hey, let's just compile pieces of the web application when they're needed. To this end they cordoned off the App_Code folder which effectively becomes its own assembly. Fine -- once you know the trick.

But their next choice was a little staggering. They decided to compile non-deterministic subsets of the application. If I hit /fred.aspx, the application might decide to also compile /ethel.ascx and /lucille.ascx... or it might decide that /ricardo.aspx "feels right" today. Again, if that floats the ASPX team's collective boats, why not. But unfortunately this non-deterministic model will occasionally choose subsets of the web site that do not correctly compile! The very semantics of program correctness are now random!

The issue is with controls which engage in what the ASPX folks (incorrectly, I believe) term circular references. Brendan Tompkins summarizes the issue and its solution quite well. If you have a true circular reference (fred.ascx requires ethel.ascx requires lucille.ascx requires fred.ascx) ASPX says you're hosed and break the dependency by introducing an abstract base class. Bummer, but okay. However because the compilation algorithm chosen by the ASPX pre-compiler is non-deterministic but usually folder-based it is possible for the compiler to believe that controls participate in a circular references when in fact only their folders do. You will drive yourself nuts trying to find the actual circular reference!

You can keep all your controls in a single folder, or at least only a few folders. That's a shame. But the real shame is that if you actually use Reflector to read the compiler in ASPX, you'll see it does a lot of homework to figure out who depends on what and how to chunk the compile up.

But surely, if the keen eager boys and girls from Redmond are already doing so much work as to determine the dependency graph of controls in the application, couldn't they apply any of a bazillion graph connectivity algorithms to compile subsets that are guaranteed to compile?! If you feel like hiding your extremely fancy compilation algorithms from me, please do so. Don't expose non-deterministic semantics into the framework!.