[Novalug] Subject: Re: NoSQL databases (and Cassandra)

Peter Larsen plarsen at famlarsen.homelinux.com
Mon Aug 30 11:51:58 EDT 2010


On Mon, 2010-08-30 at 10:53 -0400, James Ewing Cottrell 3rd wrote:
> On 8/29/2010 11:32 PM, Jim Ide wrote:
> > What makes me nervous about NoSQL databases is that they lack
> > a schema to declaratively force/enforce data validation.
> > An RDBMS like Oracle or MySQL does this for you - in a NoSQL
> > database, data is stored as a string (usually in JSON format),
> > and the programmer has to write code to enforce data validity.
> > Is there a JSON schema language and validator like XML has?
> >
> > Am I missing something?
> Well, using Oracle (or even MySQL)  just to get data validation would 
> certainly be Overkill.
> 
> And you may need to write validators anyway. SSNs, Usernames, and 
> Telephone Numbers may all be stored as Strings,
> but they have difference regex constraints.

ARGH! 
Them are fighting words!!

If all you want is validation there are probably smaller databases than
Oracle/DB2, agreed. I wouldn't put MySQL in that class though. But when
it comes to WHERE to do validation - database or the applications, I
must beg to differ here.

First off, you must remember a database is usually a central unit used
by not only multiple functions within the same application but a lot of
times used by multiple different applications too. So you end up with
the eternal question: where to place the business logic - in the
applications or in the database?  Well, my answer will ALWAYS be: the
database. Let it do what it's good at. Let the application do what it's
good at. And Lets NOT do our work over and over again, and risk
inconsistent and corrupt data by implementing the same rules differently
in different places.  I've had enough of my share of migrating away from
legacy systems that didn't have central enforcement to know what happens
to your database when you don't centrally manage data validity.

We have different types of constraints (validation rules if you so
wish): check, unique, primary key and foreign key. A check constraint is
usually used to validate a single column or the relation between two
columns. Ie. that a from-date is earlier than a to-date; that a given
value is present or that a given format is followed. We also have
data-types to help enforce part of this question - with Oracle you can
even create your own data types (objects) and use those to represent
complex data-types. A unique key is only allowing unique column(s)
values while ignoring null values, and a primary key does the same but
does not allow null values (when creating a primary key a not-null
constraint is added). And finally a foreign key enforces that a value in
a column(s) is present as a primary key instance in another table. 

4 very basic rules that should NEVER EVER be broken. So why would you
keep redefining them in the application? In particular when it comes to
referential integrity is the database by far better at the job than any
application would be.

Domain validation such as SSN and telephone numbers comes back to a
misunderstanding (In my opinion) that data representation inside the DB
should look like the output. For instance a phone number is stored with
( ) and -, and SSN with - too. When in reality they are just numbers.
Get rid of the dashes and 1) you save space, 2) validation is easier.
Now there are cases where it's not that simple, and regex is definitely
a good way to create easy to enforce validity patterns. And they can
just as easily be used as part of a check constraint as they can inside
an application. Ask yourself if a SSN in a DB is only validated/used in
one place? Hardly. Definitely phone numbers follow that pattern too. So
we know there are SOME domains that are shared. Put them in the DB. And
why split our validation patterns around and put some in one place, and
others in another? No good - so let's define all primary business rules
at the DB level - it's good at number/data crunching - and let's keep
the application good at what it is good at, presentation and dataflow
control.

I've seen developers get confused when the database is in charge of
validation. And of course with modern applications we want to alert the
user about validity problems BEFORE we enter the data into the DB
(usually comes down to developers not knowing how to catch/interpret
database events). In "modern" (last 15 years) DB development
technologies we have better help. Because the DB uses a data-dictionary
to define the validation rules, it's available as metadata for the
application. This means, that the data model on the application layer
can replicate the validation rules without any coding at all. So as you
create your applications they inherit the rules and as a developer you
"simply" have to define the error messages that makes sense to a user so
they don't get the DB error code referring to the constraint being
violated. That's a great improvement over having to recode and share
validation rules across multiple applications and modules.

The worst that can happen is that bad data is entered into the DB. So
even if you don't follow a MVC design pattern you would want to get the
data validated before entered into the DB by the DB. It's a safety
catch. You can test/validate the DB independently of the
applications/functions so you know that your rules work and cannot be
violated. Once your DB becomes inconsistent you loose data and your
application stop working. With accounting systems I've had to tell the
board about loosing hundreds of thousands of dollars due to data
failures in their application. And it's usually traced down to a single
application that didn't implement the rule(s) in the same way as the
rest.

So to be clear: I'm not saying that data-validation only belongs in the
DB. It also needs to be (in part) replicated/used in the applications.
But it originates from the DB. Change the DB validation, and you should
be changing the applications too (happens automatically today).

And all of this doesn't even discuss stored procedures and their use.
Automatic triggers or independent APIs that are all located on the DB
instead of the application. I'm a firm believer in any business rule
that deals with getting/moving/checking data belongs in the DB. Business
rules that deals with presentation, formatting etc. belong at the
application layer. The application programmer should only need to do a
simple DB call to move money from one account to another, or to do an
advanced query into the DB. 

-- 
Best Regards
  Peter Larsen

Wise words of the day:
I could dance with you till the cows come home.  On second thought, I'd rather
dance with the cows till you come home.
		-- Groucho Marx
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : http://calypso.tux.org/pipermail/novalug/attachments/20100830/22082211/attachment.bin 


More information about the Novalug mailing list