You Can Take this State And Shove It (Where the Cache Don't Shine)
So REST implies statelessness. But very few of the "standard" GIS interfaces are stateless... and if you pretend they are the performance is tragic. So can they even be considered, or do they need to be papered over? Those persistent or bored enough to read to the conclusion of our little series will eventually learn this is a bit of a red herring. Indulge me. We'll talk about statelessness and its kissing cousin cacheability &mdash and even a little URI construction — today as we tour eligible GIS protocols on our search for the applicability of REST to the GIS world.
The Files Are Okay...?
Open file-based formats (SHP, TIF, etc.) can be transported by any number of statless (enough) protocols: http, ftp, etc. Well, except that they're so big. Implementing raster formats, for instance, has little to do with figuring out what to do with the bits and bytes or even compressing them, but has to do with efficient access to smaller portions of these rasters. The tiles, pyramids, etc. all serve to allow someone to make use of a 3TB dataset a few MB at a time. And most professional-grade applications (like software updaters) that have to reliably download extremely large datasets use or implement a restartable protocol. So simply performing a GET on the whole file — while simple and stateless — is not very practical. In addition, updates to these resources typically require random access to the file, you're not going to want to POST and update the entire dataset. So we want read and write operations to be more straightforward than streaming masses of dumb files. File protocols are merely okay — we can read and write them. But they're not great because those read and updates are likely to be extremely inefficient.
This means that subsets of the data should be easily addressable as resources to allow efficient, stateless access. It is certainly possible to throw every polygon you've got into its own shapefile, and allow their enumeration through a simple directory of hyperlinks to those files. But this style of data management is completely counter to how all the GIS libraries operate, and is indeed running against the grain of systems engineering in general. So I don't think simple path-name and file-based access will cut it. It's time to head to the world of databases.
Protocols of the ESRI stack
Without any hard stats whatsoever to back me up, I'll claim that half of the GIS datasets with legal or economic standing are generated with ESRI software. My careful wording excludes the (probably much more massive volumes of) data produced by scientists in the academic or war-waging worlds. Anyone know the real stats for how much of our world's real estate parcels, public land records, oil company studies, atlases on bookshelves, etc. are generated or edited with ESRI software? Anyway, the world of ESRI is the world of the "geodatabase"
The geodatabase is anything but stateless. It lives on top of a series of increasingly more serious databases. Databases are at the bottom of essentially every useful pile of software on the planet, but everyone seems to agree that no one at the other end of a RESTy (read: Internet) connection should talk directly to one. You can't trust people, you can't have all those connections open at once, the locking semantics are a nightmare, etc. It's just not a stateless world.
Recognizing this, ESRI has now proclaimed ArcGIS Server (and it's older, now-read-headed stepchild ArcIMS) as the protocol of choice. ArcIMS can talk via the well-thought-out AXL protocol which is pretty stateless and pretty sensible. They can dress both ArcGIS Server and ArcIMS in a nice OGC WMS wrapper, which makes pretty pictures easily readable.
But reading isn't the really fun part. I want to write, to publish. Even if we're willing to drink the Kool-Aid and use ArcGIS server to publish feature classes and rasters, are we going to open up our universe to the intricate, definitely very stateful DCOM? ESRI has made it very clear: they're willing to support little toys like open specs so long as someone else does ths work. But they are adamant that all data creation take place on the highly-expensive desktop monopoly known as ArcGIS Desktop. The new mantra is "author... publish... use!", where the "author" part is most emphatically to be done with a $15,000 dongle attached to your PC! (See slides 26 & 28 from the ESRI 2007 Developer's Summit Keynote. Now there's a fun example of the files URL and statelessness conundrum. That's a big heavy slide deck, and I only wanted to call out one or two slides. Keep that in mind as you download the whole gratuitously beautiful thing and scroll to that slide. It'll be important later. What if you wanted slide #27 later?)
In summary: I'll happily write off the whole ESRI software stack as the surface of a RESTful GIS architecture for these two reasons.
- Publishing Uses the Wrong Protocols. Publishing in the ESRI world consists of nothing more than configuring extremely expensive proprietary software to serve read-only copies of data held in geodatabases (if you're lucky) or files. There's little facility for an end user to publish new data to an ArcGIS sever, or tie together data from arbitrary (stateless, cacheable!) sources. That's the dirty secret: ArcGIS server is really a SOAP and/or DCOM-happy catalog for other ESRI data sources. If you didn't like the unorganized and hard-to-scale model of "throwing your MXDs on a shared drive", you're not likely to like the SOAPy version of the same thing: "throw your MXDs on a shared SOAP server and serve them up using a thin patina of WMS". No, it's the wrong set of capabilities — there's no ability to truly post self-referencing features, rasters, attributes, etc. in a coherent data-managed whole. It's still MXDs pointing to geodatabases, seas of XML notwithstanding.
- Not the Right Economic Incentives. It's hard to fault ESRI for their position that all GIS data should be authored on ArcGIS desktop. For certain tasks, it's not a bad platform, and it has lots of support from a large vendor community. But it's simply too profitable for them cast aside in the name of a more machine-friendly, Internet-friendly authoring and publishing paradigm. (Any punters out there know whether ArcGIS server is profitable?) There may be developers at ESRI just itching to open up spatial data management to REST-like protocols, but a few thrown chairs in meetings and occasional visits to Mr. Dangermond's office will have the suitable chilling effects. After all, someone's got to pay the bills. Seriously, I can't blame them, wish I had Jack's bank account, etc., but it does mean it's hard to pin much hopes on this sort of innovation coming from Redlands; in the same way it's hard to imagine Microsoft will release the next truly revolutionary Internet or browser innovation, even if someone in Microsoft Research invents it.
The OGC contributions... or the tyranny of the FID
We've covered WMS several times in this series tangentially. Summary: it's cool. It's stateless, it serves up content that can be used to find other content, etc. But how can I publish data to a WMS server? Basically, by getting on the phone to the site administrator. Not stateless, not self-describing, not machine-friendly, not cool.
That's where WFS comes in. It's stateless, its responses are generally self-describing. I approve. Let's skip straight ahead to protocol URL particulars: there is some good stuff in here, too. For instance, section 7.1 of the spec states "every feature instance that a particular WFS implementation can operate on is uniquely identifiable". 7.1.1 even talks about globally unique identifiers. But reality strikes. ESRI doesn't support the insert/update functionality of WFS (see above cynicism!). One big open source WFS effort (GeoServer) basically passes underlying datasource FID fields through as IDs, I don't view this as very self-describing. If I want to pass around the URL of a record describing the parcel of land my house sits on, no one's going to be impressed I tell them it involves knowing that it was the 87,113th record of a SHP file. (And that was before I updated it, giving it a new FID, which I'm sure will happen despite the requirement WFS imposes that the id not change after an update...). Wouldn't the parcel number or address be more learnable and interactable?
In other words WFS gives us a common language, but lousy guarantees that the resources we talk to are findable in a natural way. And this might be splitting hairs, but the model is that the server does work on the features on your behalf (including transactionality), not that you are interacting directly with the resources (features). So I see a good start with WFS, but there's something missing. I'll come back to what it is — and it's ultimately the issue lying beneath all this general dissatisfaction with the GIS software stack.
All Hail GDAL/OGR
And now a very brief aside into a tiny technical issue. GDAL (for rasters) and OGR make no claims to implement client-server protocols, but I bring them up because they must form the foundation of any realistic large-scale implementation of public/open GIS servers. (I'm happy to be proven wrong by GeoTools or hear about other options.) If you want stateless, cacheable architectures you've got to scale on Internet servers. Which means you have to be multithreaded. GDAL came of age when this was not a consideration, so it's hard to blame them, but the lack of thread-safety is a practical technical obstacle to Internet-scale services. The problem is being addressed, but as with all open-source projects, it's probably a matter of a large corporate sponsor needing the work done or a brilliant programmer getting suddenly bored or laid off before it happens. You never know, try sending Frank a Rush t-shirt and see what happens.
And So What Now?
I've covered the "traditional" GIS-centric protocols with which I am most familiar. I know there is a world of CAD and other GIS vendors (including open source stuff) out there that is getting swept under the proverbial rug. But this survey paints a somewhat patchy picture. Some aspects of the GIS stack are good: WMS, GML, WFS, shapefiles, and some are not so good: ESRI geodatabases, WFS transactions. (And some need roof repairs but are still attractive: GDAL/OGR.)
Next time we'll look at KML, GeoRSS (and its distant relative APP) — the kids that seem to be hanging out at all the popular bars these days — and what they might teach us. All on our way to the conclusion that may have been not-so-subtly coming through in my commentary: there is no spoon.