This page will record thoughts on databases. It’s not for tracking particular databases, but for thinking about the general problems they have.
SQL is a mishmash. It’s inconsistent. It’s badly designed as a language. And it’s not complete. Any good database should completely replace it with something that’s better. The db can still offer an SQL compatibility layer mapping onto the underlying language.
Tutorial D is a possible way out of the mess
Tutorial D syntax
Tutorial BNF
You need more than one special value. Null is not good enough. There’s missing. There’s unknown. There’s inapplicable. There’s 0/0 - i.e NAN. And more. You need flexible rules on what to do with them.
See:
http://www.dbdebunk.com/page/page/2296478.htm
http://use.perl.org/~Ovid/journal/27927
http://www.dbdebunk.com/page/page/1396241.htm - note to self: was the earlier version public access on his website?
http://www.dbdebunk.com/page/page/2760386.htm
http://web.onetel.com/~hughdarwen/TheThirdManifesto/Missing-info-without-nulls.pdf
http://thethirdmanifesto.com/
You need triggers. Tables could be thought of as objects with triggers as part of the object. In most database the idea of message flow has no place, but if you have action rules for tables then ordering and message flows play a part.
There’s an issue that isn’t really well-resolved; where do you draw the line between integrity relationships and business logic, and where do you place logic.
Database vendors want the all the business logic in the database. But this means that you end up tied to the vendor.
On the other hand, putting the logic in software means that the database is incomplete without the software.
Perhaps what is needed are mechanisms for external binding on events, plus a formal mechanism for packaging external software.
Whichever way it’s handled, for large systems, you definitely need ways of partitioning logic into modules. I’d imagine that the modules would be attached to tables. But you might need a formal way to define connections and traverse them. So if a person had an address and they were stored separately, the business logic for addresses might check for the coherence of the zip/postal code and the address. Or the logic might generate the address from the zip code. A query on the address should automatically trigger the address logic.
What happens if you get errors during the triggering? For instance I select all addresses and one of 10,000 has a zip which doesn’t match the address. Does the whole query fail? Does the query succeed with provisos? This can lead to trouble: what if the query was an update? Do we update except for the failures? Atomic transactions gone when you update salary to salary + 100, and one of the salaries is undefined because it’s below the legal wage threshold?
Most databases have no way of dealing with time. Maybe versioning is part of this.
Things move.
Tables should be able to hold rows of 50,000 values.
Or rather, the width of a table should not be contrained.
A few years ago, I worked on a problem where the internal structure of the data was unknown at the time of aquisition. So on a particular run of data aquisition you could get 10 records of 500 items, or 5,000 records of 8,000 items. Or anything in between. The data did have an internal structure, but that would only become apparent later. So, for instance, the record of 500 items might be a block of 100 describing test measurement situation, and then 40 records of 10 values of measurements. You can see that to store in a relational database, you have to know about internal structure of the data. So an ideal RDB would let you restructure the data. Ideally in place.
Databases should have a notion of type, and should allow extensibility.
Some notes:
If you are allowed to define new types, you should be able to define new operators. To allow the database to interoperate with the new types you will need to be able to define a minimum number of operations. For instance being able to affirm that a value is considered to be unique would be useful it would allow the not equal operation to be implemented for that type. Separately, an identity operator might provide a useful implentation shortcut. The ability to indicate the equivalance test operator, (a == b) or (a === b) or similar, would be useful since equivalence and two items holding identical values are not necessarily the same.
They should understand objects which are composed of parts of tables. They should include a formal model to describe the relationships between tables that the system knows about and the objects that they represent.
Referential integrity is A Good Thing. It also makes databases bad at adapting to the world. Different strategies are needed to deal with this. Sometimes you know someones address before you know their name - you need placeholder values that say ‘This data isn’t known right now, but it’s valid to work as if it is’.
Databases are very often (almost always?) used together with software. It’s the software’s job to take data from the user and store it, and then later to retrieve it and display it or perform some other function. The object-relational impedance is always an issue when software works with a database. We have worked out ways of dealing with this problem. However there’s another important related problem: flat-tabularised data... When writing software you often see data with summary data interspersed. For instance if we were tracking consumer spending habits we might have records for the household and records for the individuals in the household. If we displayed the data we’d want to group the individuals in the household together, so on the screen we might show:
Household 1 - Smith\\ Joe Smith - Adult - spending in 2005: £20,000\\ Samantha Smith - Adult - spending in 2005: £41,000\\
Household 2 - Li\\ Jill Carpenter - Adult - spending in 2005: £12,000\\ Milly Li - Adult - spending in 2005: £17,000\\
An interesting thing about this data is that each household item has subitems. This does not fit nicely into a flat two dimensional table. Instead you need to have a table of people in each household table data item. Standard SQL doesn’t do this nicely; you’d have to have redundant data.
A second difficulty with this example is that you have info displayed which crosses the boundaries between household and individual: the main name associated with the household. In the example the main name might be stored separately from the surnames the individuals in the household, in which case there will be issues of database consistency when members of the household change (perhaps the household is a commune and the person whose surname you have used as the household name moves out). Alternatively, you have to be able to treat calculated summary values as first-class database citizens. In the example, the household name might be calculated with reference to the highest 2005 spending.
There are two common ways of software talking to a database in order to retrieve the data: 1. Query the database once and produce a flat table with redundant data. 2. Query the database multiple times, but each time produce the data you need
There is no common way to query the database once without reduncy.
For 1, you would produce a table like:
Household 1, Smith, Joe Smith, Adult, £20,000 Household 1, Smith, Samantha Smith, Adult, £40,000 Household 2, Li, Jill Carpenter, Adult, £12,000 Household 2, Li, Milly Li, Adult, £17,000
then you would programatically extract only the items you want to display the data in the form above.
For 2, you would do something like this
Produce a table of all households
For each household
{
Print the overall household details
Produce a list of household members and print them
}
We need a third way:
Produce a table of households with subtables of members
For each household
{
Print the overall household details
Print the household members subtable
}
An interesting side effect would be that we should be able to order by items in the subtables.