Here's an interesting post on design practices for building massively-scalable apps on database infrastructure such as Google BigTable.
The takeaway: this ain't your granpappy's old relational database system, so throw out everything he taught you. Denormalize. Prefer big fluffy things to small granular things. Don't bother with DB constraints - enforce the model in the application. Prefer small frequent updates to large page updates.
The good news (or bad, depending on how fed up you are with your local DBA) - don't bother with all this unless you intend to scale to millions of users.
Showing posts with label database. Show all posts
Showing posts with label database. Show all posts
Tuesday, June 24, 2008
Thursday, May 1, 2008
Database Architecture monograph
In the era of cloud computing, map/reduce, dynamic languages and the semantic web, the stalwart Relational Database Management System is looking a bit fusty. But RDBMSes were perhaps the earliest widely-deployed example of many of the techniques of distributed computing, concurrent programming, and query optimization that are still highly relevant.
Hellerstein, Stonebraker and Hamilton have published a monograph on Architecture of a Database System. With names like that involved you'd expect high quality and some deep insight, and the article delivers. It's a good, accesible summary of the state-of-the-art in RDBMS technology.
Hellerstein, Stonebraker and Hamilton have published a monograph on Architecture of a Database System. With names like that involved you'd expect high quality and some deep insight, and the article delivers. It's a good, accesible summary of the state-of-the-art in RDBMS technology.
Labels:
computation,
database
JAQL pegs the cool technology mashup meter
JAQL is a query language and engine which has XQuery-like syntax, SQL-like operators, JSON as a native data format, and runs using the Hadoop ma/preduce framework. (Although not mentioned explicitly, it's probably great for social networking as well...)

You might think that JAQL and JEQL were separated at birth, but they actually have no genetic material in common. But it's interesting to see the J*QL acronym space being rapidly populated. The best two vowels are now gone - who's going to be next to pile in?
You might think that JAQL and JEQL were separated at birth, but they actually have no genetic material in common. But it's interesting to see the J*QL acronym space being rapidly populated. The best two vowels are now gone - who's going to be next to pile in?
Labels:
computation,
database
Wednesday, January 30, 2008
The End of an Architectural Era?
Stonebraker et al have an interesting paper pointing out the antiquity and consequent limitations of the classic relational database architecture in today's world of massive disk/cycles/core.
If they are correct (and in spite of the recent MapReduce blunder Stonebraker has made a lot of great calls in the DB world), the world of data management is going to get awfully interesting in the coming years. The DBA's & DA's of this world have been living a relatively comfortable existence compared to those who are wandering across the stormy badlands of the middle tier. But Stonebraker postulates at least 5 radically variant database architectures to address specific use cases of data management. This would seem to lead to a much more complex world for data architects. But maybe a windfall for the consultants who find their niche?
If they are correct (and in spite of the recent MapReduce blunder Stonebraker has made a lot of great calls in the DB world), the world of data management is going to get awfully interesting in the coming years. The DBA's & DA's of this world have been living a relatively comfortable existence compared to those who are wandering across the stormy badlands of the middle tier. But Stonebraker postulates at least 5 radically variant database architectures to address specific use cases of data management. This would seem to lead to a much more complex world for data architects. But maybe a windfall for the consultants who find their niche?
Labels:
computation,
database
Thursday, September 13, 2007
Geodetic data in PostGIS - Spherical indexing schemes
Here at Refractions we're starting to think about how we can provide better support in PostGIS for geodetic data. Geodetic data is data which is defined in a true spherical coordinate system (in particular, of course, the surface of the Earth).
Currently PostGIS provides some geodetic-aware functions (such as distance between two geodetic points). But it's current spatial data model is fundamentally planar, so there are definite limitations to modelling geodetic data (e.g. such as the notorious "line crossing the Date Line" problem). As PostGIS gets used for larger datasets and more ambitious projects, the utility of having a full-function geodetic data model is becoming increasingly obvious.
Handling geodetic data in a correct and efficient way presents quite a few challenges. A major one is: how can geodetic geometry be spatially indexed? Conventional spatial indexes (such as 2D R-trees) all rely on geometry being embedded in a planar space. They don't handle data which can "wrap around", as can occur in a spherical space.
There have been various clever proposals for spherical indexing strategies. Some prominent ones are listed below:
Currently PostGIS provides some geodetic-aware functions (such as distance between two geodetic points). But it's current spatial data model is fundamentally planar, so there are definite limitations to modelling geodetic data (e.g. such as the notorious "line crossing the Date Line" problem). As PostGIS gets used for larger datasets and more ambitious projects, the utility of having a full-function geodetic data model is becoming increasingly obvious.
Handling geodetic data in a correct and efficient way presents quite a few challenges. A major one is: how can geodetic geometry be spatially indexed? Conventional spatial indexes (such as 2D R-trees) all rely on geometry being embedded in a planar space. They don't handle data which can "wrap around", as can occur in a spherical space.
There have been various clever proposals for spherical indexing strategies. Some prominent ones are listed below:
- Hierarchical Triangular Mesh - this is essentially a "quad-tree for the sphere". It has a lot of appeal for use with point data, since it provides a hierarchical key which can be indexed using a conventional B-tree index. (It was co-developed by the late, great Jim Gray in order to index astronomical data). The mathematics to determine the index key for a non-point object would seem to be somewhat complicated. It also seems like HTM would suffer from the usual disadvantage of quadtrees of not being very self-tuning. Another disadvantage from the PostGIS point of view is that this would likely be a brand new index type (i.e. lots of difficult code to write)

- Hipparchus Voronoi-based index. This index can be thought of as a fixed-grid index using a custom Voronoi cell coverage for the globe. IBM's DB/2 Geodetic extension uses this scheme. I must say that this concept, while ingenious, seems a bit baroque to me. This index has the usual disadvantage of fixed-grid indexes of not handling widely-varying data sizes well. And it also requires extremely complex cell coverage structures, which have to be selected specifically for the expected data composition. DB/2 supplies 13 different ones based on various data densities (G7 industrial output, anyone?). I'm not sure what you are supposed to do if your data has some other density distribution - it doesn't seem very feasible to make your own Voronoi grid. And what if you don't know your data distribution, or it changes over time?

- 3D Bounding Box - this is the approach used by the pgSphere project. It's pretty straightforward. The key concept is to embed the sphere in 3-space, so that it is possible to compute 3D bounding boxes for geometries on the embedded sphere. The bounding boxes can then be indexed using a 3D R-tree (exactly analogous to a 2D R-tree spatial index). The GIST index supplied with PostgreSQL can be customized to provide the required 3D R-tree. One possible issue is that apparently R-trees become "less effective" in higher-dimensional spaces. It remains to be seen whether this is truly a serious problem.
Labels:
computational geometry,
database,
geospatial,
postgis
Tuesday, September 4, 2007
Grokking hierarchical queries in Oracle
Oracle provides a very powerful SQL extension to evaluate hierarchical queries (transitive closures) on tables. This recently saved my bacon in a system which processed a table modelling a tree containing several hundred thousand rows. The alternative would have been to do some very ugly and inefficient iterative querying.
There's a few tutorials about Oracle hierarchical queries on the Web, but I didn't find any of them gave me a good mental model of how these queries are evaluated. In particular, they didn't really help me in figuring out how to traverse a tree structure either upwards or downwards. So here's my attempt at explaining some patterns for using hierarchical queries. (This isn't a complete tutorial - for that check the Oracle 10g documentation)
For modelling tree-structured data, the usual pattern is to have a table with id and parent_id columns. In a hierarchical query you may wish to traverse the tree either upwards (towards the root(s)) or downwards (towards the leaves). But what query syntax should be used to produce the desire direction?
The general syntax for CONNECT BY is:
CONNECT BY [PRIOR] col1 = [PRIOR] col2
A further constraint is that the PRIOR keyword can appear once only, but it can appear on either side of the equality condition. Since the equality condition is symmetric, there are really only two possibilities:
1. The result rowset is initialized with the rows determined by the START WITH clause
2. All rows for which the CONNECT_BY condition is true are added to the result rowset. The PRIOR keyword determines which one of the condition expressions is evaluated in the context of the result rowset rows.
3. Step 2 is repeated until no further rows match.
Formally, this procedure computes the transitive closure of the relation defined by the initial START WITH set and the CONNECT BY condition.
Given this evaluation rule, we obtain the following rule-of-thumb for writing a query to traverse in a desired direction:
There's a few tutorials about Oracle hierarchical queries on the Web, but I didn't find any of them gave me a good mental model of how these queries are evaluated. In particular, they didn't really help me in figuring out how to traverse a tree structure either upwards or downwards. So here's my attempt at explaining some patterns for using hierarchical queries. (This isn't a complete tutorial - for that check the Oracle 10g documentation)
For modelling tree-structured data, the usual pattern is to have a table with id and parent_id columns. In a hierarchical query you may wish to traverse the tree either upwards (towards the root(s)) or downwards (towards the leaves). But what query syntax should be used to produce the desire direction?
The general syntax for CONNECT BY is:
CONNECT BY [PRIOR] col1 = [PRIOR] col2
A further constraint is that the PRIOR keyword can appear once only, but it can appear on either side of the equality condition. Since the equality condition is symmetric, there are really only two possibilities:
- CONNECT BY PRIOR parent_id = id (which is equivalent to CONNECT BY id = PRIOR parent_id)
- CONNECT BY PRIOR id = parent_id (which is equivalent to CONNECT BY parent_id = PRIOR id)
1. The result rowset is initialized with the rows determined by the START WITH clause
2. All rows for which the CONNECT_BY condition is true are added to the result rowset. The PRIOR keyword determines which one of the condition expressions is evaluated in the context of the result rowset rows.
3. Step 2 is repeated until no further rows match.
Formally, this procedure computes the transitive closure of the relation defined by the initial START WITH set and the CONNECT BY condition.
Given this evaluation rule, we obtain the following rule-of-thumb for writing a query to traverse in a desired direction:
- To traverse upwards, use form #1
- To traverse downwards, use form #2
Labels:
database
Monday, July 9, 2007
Relations - the machine code for data?
In the rushing torrent of computer science, a few concepts stand like rocks which resist being swept away. One of the most familiar and useful is relational database technology. It took a relatively short time for this concept to be identified (in Codd's paper of 1970) and implemented commercially (around 1975, in IBM's System R, among others). Perhaps only 10 or 15 years elapsed between the earliest database systems and the emergence of commercially viable RDBMS . For over 30 years since then RDBMS's have stood the test of time and repelled all challengers to their central role in data management.
A recent and serious challenger was Object database technology. In the 1990's this area looked poised to ride the gathering wave of object-oriented programming and sweep aside the fusty old RDBMS technology. Since then, while the OO wave has swept far up the beach, the promise of ODBMS seems to have been left behind in the receding foam, never having been fully realized.
So why is this?
I think one reason is that the relational paradigm could be a fundamental data structure for information representation. In a sense, relations are the machine code of data modelling - they can be used to represent any required data model. Granted, relational models aren't alway elegant or efficient - but that simply confirms the metaphor. Moreover, their simplicity allows relational theory to be firmly grounded in mathematics and logic. (Of course, the relational hard-cores have been saying this for years - in some cases a bit too vociferously, IMHO. I'm just reluctantly coming round to think that they might be right.)
A further piece of evidence for this observation is that RDBMS vendors have a fine track record of being able to adapt the core paradigm to accomodate new technical ideas. Queuing, security models, data analytics, and of course even (to an extent) object-orientation are some of the areas which have been implemented quite reasonably in terms of an underlying relational model.
ODBMS's, in contrast, have never quite seemed to develop a truly compelling and general paradigm. The various OO query languages don't seem to have the expressive power of SQL, and Object data models seem to run up against thorny semantic issues sooner than relational ones do (such as schema evolution and inheritance modelling). Granted, implementing an object-based data model using relational technology doesn't actually solve any of these issues, but somehow the maturity of relational techniques and tools make them seem less painful.
Some people (myself included) have tended to see the rise of Object-Relational mapping technology as at best a temporary expedient, and at worst a string-and-baling-wire solution to cope with the OO-RDB impedance mismatch. Just give ODB's a few more years to mature, we hope, and we can do away with this inelegant hack! But if relational really is a fundamental concept, perhaps the quest for pure OO databases is misguided. Perhaps O-R mapping tools are the final goal, not just a stop-gap. They can be thought of as compilers for the data access layer (with all the opportunity for performance improvement and platform independence that the compilation paradigm provides).
This isn't to say that RDB technology has evolved as far as it needs to. I hope to see much more powerful querying capability, using concepts from the areas of logic and functional programming, machine reasoning and knowledge representation. All of these seem to be continuing to mature - I look forward to seeing them make the leap to practicality like OO did years ago.
A recent and serious challenger was Object database technology. In the 1990's this area looked poised to ride the gathering wave of object-oriented programming and sweep aside the fusty old RDBMS technology. Since then, while the OO wave has swept far up the beach, the promise of ODBMS seems to have been left behind in the receding foam, never having been fully realized.
So why is this?
I think one reason is that the relational paradigm could be a fundamental data structure for information representation. In a sense, relations are the machine code of data modelling - they can be used to represent any required data model. Granted, relational models aren't alway elegant or efficient - but that simply confirms the metaphor. Moreover, their simplicity allows relational theory to be firmly grounded in mathematics and logic. (Of course, the relational hard-cores have been saying this for years - in some cases a bit too vociferously, IMHO. I'm just reluctantly coming round to think that they might be right.)
A further piece of evidence for this observation is that RDBMS vendors have a fine track record of being able to adapt the core paradigm to accomodate new technical ideas. Queuing, security models, data analytics, and of course even (to an extent) object-orientation are some of the areas which have been implemented quite reasonably in terms of an underlying relational model.
ODBMS's, in contrast, have never quite seemed to develop a truly compelling and general paradigm. The various OO query languages don't seem to have the expressive power of SQL, and Object data models seem to run up against thorny semantic issues sooner than relational ones do (such as schema evolution and inheritance modelling). Granted, implementing an object-based data model using relational technology doesn't actually solve any of these issues, but somehow the maturity of relational techniques and tools make them seem less painful.
Some people (myself included) have tended to see the rise of Object-Relational mapping technology as at best a temporary expedient, and at worst a string-and-baling-wire solution to cope with the OO-RDB impedance mismatch. Just give ODB's a few more years to mature, we hope, and we can do away with this inelegant hack! But if relational really is a fundamental concept, perhaps the quest for pure OO databases is misguided. Perhaps O-R mapping tools are the final goal, not just a stop-gap. They can be thought of as compilers for the data access layer (with all the opportunity for performance improvement and platform independence that the compilation paradigm provides).
This isn't to say that RDB technology has evolved as far as it needs to. I hope to see much more powerful querying capability, using concepts from the areas of logic and functional programming, machine reasoning and knowledge representation. All of these seem to be continuing to mature - I look forward to seeing them make the leap to practicality like OO did years ago.
Labels:
computation,
database
Subscribe to:
Posts (Atom)