Wednesday, June 25, 2008

ArcGIS Explorer Build 480: Threading!

ArcGIS Explorer (AGX to its friends, apparently) has a new release out, with a great number of promised new features. I'd watched previous builds of AGX through an HTTP proxy to see what all it got up to while fetching images from the web. Investigation showed that it was brutally single-threaded and not terribly bright about what order to retrieve tiles in.

Build 480 changes those things a great deal for the better. Even when looking at a single layer, such as the default satellite/aerial imagery provided, you can see AGX running 2 to 4 simultaneous connections to fetch tiles. Their documentation promises multi-threaded downloads only on a dual core machine, which I've got, so your mileage may vary. I would have thought that even a single core machine would benefit from having several HTTP requests in flight, as they involve so much waiting around, but one suspects ESRI did some performance testing.

When connecting to a very slow service, such as most of the WMS servers out there, I was able to see AGX with as many as 10 connections in flight at once. This is good! There are still some starvation issues (the crappy WMS service kept the fairly snappy ESRI service from showing up for a long time), but this is a great improvement.

The order in which tiles are retrieved still seems suspect to me, with the world view first fetching Asia, the north pole, and Antarctica before the western hemisphere view which is the default. Grabbing multiple images at once compensates for the imperfect tile fetching strategy. And like before, shutting down AGX while it has several open HTTP connections is not pretty: it waits until they are timed out to shut down completely (even continuing to fetch new tiles while it tries). These flaws mean that you still need to be careful using any service which is slow or broken -- other imagery will get stuck behind it.

More news as investigations continue. So far, so encouraging!

Wednesday, June 18, 2008

Drilling for new oil in the USA

So the debate over drilling within the US has opened up again. There will be lots of debate about environmental impact that I don't feel sufficiently informed on to comment about. But given the success Bush has had redefining US foreign policy as a contest to see who can be the biggest asshole (read: tough on terrorists), I suspect the debate will revolve once again around that old canard: "energy independence". The idea is that if we drill more at home, we'll be less dependent on foreign oil.

Let me present a slightly different viewpoint I don't hear discussed much in the media. Allow me a couple of asides:

First, the notion that oil drilled in the United States will be consumed in the United States is of course not entirely true. Oil produced in Alaska may be much more cheaply consumed in Japan than in New York. That's just geometry. The market will make that decision based on shipping distances, quality of crude, availability of refining, and a thousand other factors. What domestic drilling may do is increase the world's supply of oil a bit (or at least slightly arrest the decline of US production), and thereby place downward pressure on the price of oil. This seems like a laudable goal.

Second, you will hear a few people say that it's not worth drilling for more oil because the amount is tiny and would not be felt for many years. Just because something won't have an effect for 5 years doesn't mean it shouldn't be done. But it's not often appreciated how much oil we consume and how much impact US exploration might have. Of course the numbers are always open to debate, but take the two extremely massive oil fields recently discovered off the coast of Brazil. These are game-changing fields, some of the largest on the planet. They total roughly 10-20 billion barrels of oil equivalent. That's a lot, right? What if we found those hiding in the Gulf of Mexico. Even though we're pretty certain there's nothing that big out there, even so... These fields, if drained completely dry, would only serve the US's oil consumption for less than three years. Three years. The numbers are staggering. Drilling activists like to suggest that there is a ton of oil in the ground that the tree-huggers are just hiding from us. That's true: but the amount of oil is not particularly significant if you look at it from a multi-decade point of view.

Which brings me to the point I was really trying to make. The point is actually very simple.

There's not a lot of oil left at reasonable prices given current consumption and growth trends. It's very hard to tell whether this price spike is the beginning of the end, or just a head fake, but the end is coming in our lifetimes. As any petroleum geologist will tell you, "the end" will arrive with lots of oil left in the ground, just too expensive for our means. When that end comes, when oil is $1000 per barrel, gasoline is rationed for national security reasons, when poorer countries without access to alternative energy technology are going to war to secure the oil they need to fertilize their crops so they can eat and drink, what situation do you want the US to be in? With major reserves already tapped to secure a few extra years of $4/gallon gasoline? Or with major reserves available within our borders to provide the fuel the army and navy require to secure peace in this dangerous world? Do you want the strategic oil reserve to have been run down to keep the cost of Summer road trips low? Or in place to ensure the smooth functioning of the military when Mexico, Venezuela, and Nigeria won't have any oil left to export? Do you want to have burned all the oil to light up Starbucks signs at night, or have some left over to maintain the crop yields which allow our nation to produce something the rest of the world thinks is worth buying?

Oil is running out. Aside from the obvious implication that we should be working on alternative sources of energy (and oil at $140/bbl is doing that much better than any Congressional plan would), it's not obvious to very many people that we should keep what's left for ourselves. Every nation should be looking towards energy security, just as they look for food security. Why should we pawn our future for a few years of fun in the present?

Friday, June 13, 2008

Greenspun Tenth Rule Redux: Vista is Terrible

I just got a new laptop with Vista 64 on it. In lieu of flowers, please send donations to the ACLU. But I digress.

It turns out you can't install Visual Studio 2005 from a mount point in Vista. You know, I thought it would be convenient to move on from the bad-idea-when-it-started-14-years-ago of always agreeing to map the same drive letter on each of my computers to the same actual share I maintain of software, music, etc. I thought, hey, Vista supports symbolic links. I'll map c:/users/public/software to \\myserver\public\software. And I'll install from there. No drive mappings to remember to add, remove, wait for Explorer to hang from. I'm afraid not. The VS 2005 install just fails. Why? How can it even possibly tell the difference?

Here's the kicker. Even when I mapped my F: drive like we used to do in the gay nineties, the install failed. The amazing part? The install log still complained about not being able to read from c:/users/public/software. Deep down in the system, it knew that F: was also mounted to that point on my C: drive! How could it be? Neither F: nor c:\users\public\software are the 'real' location of that share. Good lord. How does Microsoft wait 20 years to implement links and then get them wrong?

It reminds me of Greenspun's tenth rule of programming:

Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Common Lisp.

To which I should add: Any sufficiently mature operating system contains an ad-hoc informally-specified bug-ridden implementation of half of Unix. And in the great spirit of Raymond Chen, pre-emptive snarky comment: yes, that probably applies to most Unix implementations.

Monday, May 05, 2008

SDE 9.2's ST_GEOMETRY: Part Two, The Empire Strikes Back

I've been investigating ESRI SDE's ST_GEOMETRY support and performance on some simple sample data sets. I'm embarassed to say I don't recall where they came from, but they are all US counties (about 3,000) and all US ZIP codes (about 30,000). My initial reactions were excitement at the speed of spatial joins on small datasets, disappointment at slow performance on large datasets, followed by theorizing about the cost of out-of-process calls.

Well, in these things it's not the journey, it's the destination. I thought I better compare to Oracle Spatial's SDO_GEOMETRY to make sure I was comparing apples to apples, as it were. The results are interesting.

I'm trying two simple tests of throughput, one a spatial join which selects all the ZIP codes which overlap a given (semi-random) set of counties, the other a raw select which forces all geometries to be converted to WKB format for consumption by a putative 3rd party tool.

For Oracle

select count(*) from ozip zi join ocounty co on (sdo_relate(zi.shape, co.shape, 'mask=ANYINTERACT') = 'TRUE') where co.objectid between 1200 and 1500; select sum(dbms_lob.getlength(sdo_util.to_wkbgeometry(shape))) from ozip;

For SDE

select count(*) from zip zi join county co on (st_intersects(zi.shape, co.shape)=1) where co.objectid between 1200 and 1500; select sum(dbms_lob.getlength(sde.st_asbinary(shape))) from zip;

What should the baseline be? I figure it's the existing geoprocessing and rendering tools from ESRI. So for the spatial join I just used the Intersect toolbox on the feature classes with the 'objectid between 1200 and 1500' as the definition query for counties, as above. For raw rendering, I simply averaged 3 successive refreshes of a map full of all 30,000 zip codes. This makes the baseline look even slower than it really is, as it also times rendering and context switches. SDE's fast, people; we knew that.

The somewhat surprising results are as follows (all times in seconds).

OperationSDE/SDOSDE/STST_GEOMETRYSDO_GEOMETRY
Join25256.0?
Scan15535

I got tired of waiting after 15 minutes for Oracle to convert its 30,000 zip codes to WKB, so I gave it a score of ∞. All queries appeared to be CPU bound, which makes sense as the entire data set fits into the memory of even this old laptop.

Update 2008-May-5: In amusing and ironic twist, Paul Ramsey has notified me that Oracle doesn't let you release benchmark results without making sure they've been dolled up and faked by their sales engineers, er, excuse me, cleared by their legal department. I've removed some of the actual figures from the charts above. Infinity's hard to hide in a closet, even with slick marketing.

The lessons are certainly mixed.

Joins. For the kinds of in-SQL-on-the-fly joins that make geometry data types tempting, SDO_GEOMETRY might be a clear winner (better ask Oracle legal) but not by a massive amount (?? seconds versus 6.0 seconds, making ST_GEOMETRY ??% slower). Perhaps this is because of the process switches endured by implementing ST_GEOMETRY in st_shapelib, perhaps not. Perhaps it is the difference in indexing schemes. More complex and varied tests would need to be done. For back-of-the-envelope estimates, they're roughly equivalent. For most applications, if 10 seconds is acceptable, so is ??. ST_GEOMETRY and SDO_GEOMETRY are both certainly far smarter than the Intersect toolbox, which spends most of its time querying more data than it needs.

Scans. If you want to grab data from these systems and process them using open standards, you lose either way. With ST_GEOMETRY, you can do the work using ST_ASBINARY, but the performance is unimpressive: roughly 1000 shapes per second. (Compare at 6000 per second for SDE querying ST_GEOMETRY itself and ArcMap rendering them.) SDO_GEOMETRY goes completely to lunch; they're not taking WKB very seriously. My 3:1 showing for SDE rendering of ST_GEOMETRY versus SDO_GEOMETRY layers throws a lot of these numbers into question. Would ESRI really let their Oracle Spatial implementation be that much slower?

My provisional conclusion is that ST_GEOMETRY holds promise for spatial SQL, as much as SDO_GEOMETRY, though it probably needs more tuning from ESRI's side. No one is going to be writing useful GIS tools which use the WKB/WKT forms of these geometries anytime soon. If you want fast scanning of data, you've got to get under the covers and read the data natively. That is, I suppose, the next experiment. I'd like to know how easy it will be to read ST_GEOMETRY data natively to .NET, where my particular bread is buttered. More to come.