HTML Form Data Validation, the Pragmatic Way

Table of Contents

  1. Purpose
  2. Introduction
  3. On validation of the DATA
  4. In search of the Best Technical Solution
    1. Of FORMs layout, CSS and TABLEs...n updated 12/08/2009
    2. Of HTML Validation
    3. Of Portability
    4. Possible Solutions
  5. On Performance
  6. On Cost Effectiveness
  7. Conclusion

1. Purpose top next section

Easy user-input data validation for any FORM having fields being constrained to "DATE", "REQUIRED", "EMAIL", "NUMERIC", "NOFOCUS" etc

2. Introduction top previous next section

Ceci n'est pas un marronnier de plus sur le sujet ;-)

Okay : let's first state that such a subject has been debated over the years, and still is.
There are several sub-considerations in it, of which some are debatable, and some are not.
That topic touches many areas:

  1. FORM DATA validation in itself
  2. FORM layout
  3. HTML validation of the solution
  4. PORTABILITY(²) of the solution
  5. PERFORMANCE of the solution
  6. ROI(³) and Cost Effectiveness of the solution

(²) Portability : ability to be ported ; initially, to a different platform ; by extension ability to behave the same in different environments ; in our case, it means "cross-browser support", "W3C HTML standards compliance", "cross-platform support", "accessibility level" etc

(³) Return On Investment : in short, the possibility of future savings compared to the initial cost. For us, a good ROI means "the solution demanded as few efforts as possible to be put in place, and we can expect a lot of development time saved afterwards".

3. On validation of the DATA top previous next section

  1. data validation overview
  2. validating data on the client
  3. validating data on the server

I guess everybody will agree that validating user-input data is absolutely necessary. Everybody will also probably agree that validating on both sides can't do any harm and will balance the validation load (CPU-time cost) fairly between the user (client browser) and site (server software).
Validating user-input data not only protects your server (as a whole) against malicious attacks (XSS, Active Scripting or ActiveX etc ad nauseam - see SANS, NIPC and FBI's "SANS Top-20 Internet Security Attack Targets (2006 Annual Update)" - ) but also protects your precious data, especially when they're stored in a database (SQL code injection, etc).
Validating data on the server before any action (writing, obeying commands, performing actions on filesystem etc) is a must. Validating data on the client before sending them to the server is a good practice but is not mandatory.
Validating data on the server ensures they are valid and come in the format expected, so that they're compatible with the forthcoming data handling process.
That validation we will not mention here, but again it is mandatory. Don't even dare to write a FORM on your site if you don't strictly validate the data coming from it.
Validating data on the client usually serves as a first layer of protection, more against mistakes than against real abuse, and saves time in that you don't have to round-trip to the server to just display "you forgot to tell me your name before submitting the FORM" ;-)
That validation can be tedious to put in place. One FORM field is a date, an other one has an allowed range, some are required and some are not... and you've to do this in javascript. Solving this problem in an elegant way, so that you don't have to recode every FORM validation every time, is the goal of this article.

4. In search of the Best Technical Solution top previous next section

4.1. Of FORMs layout, CSS and TABLEs... next subsection

FORMs philosophy

FORMs are used to present in an arranged way "fields" for the user to input data.
In this sense, there is no wonder the (textual, displayed) name of the field is so closely linked to the actual INPUT field (named via the name attribute) that it should appear perfectly-placed side-to-side the each of the other.
Thus, no wonder people have historically used TABLEs to present such data.

In my humble opinion, this choice has nothing to do with "mangling contents and presentation", in other words what "CSS zealots" see as the problem of "separating data and layout".
FORMs have an inherent layout constraint or else they would lose all their meaning. HTML 4 allows a perfect match between FORM and TABLE layouts.
As a consequence, there is a deep impact on FORMs building of the "TABLEs versus CSS layout" so-called "issue". More on this later.
A bit of history now :
The HTML model for TABLEs was introduced in HTML 3.2 (1996) as an extension to HTML 2.0 (1995) ; in the same bunch of additions, you'll find also HTTP upload using FORMs (input type="file") and Java applets via APPLET.
The HTML 3.2 specification says

Tables
HTML 3.2 includes a widely deployed subset of the specification given in RFC 1942 and can be used to markup tabular material or for layout purposes. Note that the latter role typically causes problems when rending to speech or to text only user agents.

So it's crystal clear TABLEs are allowed for layout except for text-mode browsers (but some do support TABLEs in a way) and disabled users relying on speech to "read" a website (not that common, huh ?).
They don't even use LABEL et pour cause, as it isn't in the spec.

HTML 4.01 (Dec. 1997 to Dec. 1999) introduced "more multimedia options, scripting languages, style sheets, better printing facilities, and documents that are more accessible to users with disabilities".
HTML 4 introduced the THEAD, TBODY and TFOOT elements, and says

The HTML table model allows authors to arrange data -- text, preformatted text, images, links, forms, form fields, other tables, etc. -- into rows and columns of cells.

For impaired people, HTML 4 offers the TH element which allows that "Non-visual user agents such as speech synthesizers and Braille-based devices may use the following TD and TH element attributes to render table cells more intuitively". LABEL also appears to help further visually-impaired people to interact with FORMs.
This is to be used as "Some form controls automatically have labels associated with them (press buttons) while most do not (text fields, checkboxes and radio buttons, and menus)." (HTML 4 spec)
This is the kind of standard code as offered by the HTML 4.01 spec (I remind the reader that this is the latest HTML standard available) :


<FORM action="..." method="post">
<TABLE>
  <TR>
    <TD><LABEL for="fname">First Name</LABEL>
    <TD><INPUT type="text" name="firstname" id="fname">
  <TR>
    <TD><LABEL for="lname">Last Name</LABEL>
    <TD><INPUT type="text" name="lastname" id="lname">
</TABLE>
</FORM>

This example extends a previous example form to include LABEL elements.

 <FORM action="http://somesite.com/prog/adduser" method="post">
    <P>
    <LABEL for="firstname">First name: </LABEL>
              <INPUT type="text" id="firstname"><BR>
    <LABEL for="lastname">Last name: </LABEL>
              <INPUT type="text" id="lastname"><BR>
    <LABEL for="email">email: </LABEL>
              <INPUT type="text" id="email"><BR>
    <INPUT type="radio" name="sex" value="Male"> Male<BR>
    <INPUT type="radio" name="sex" value="Female"> Female<BR>
    <INPUT type="submit" value="Send"> <INPUT type="reset">
    </P>
 </FORM>
Note also that HTML 4.01 introduces FIELDSET and LEGEND like in :

<FIELDSET>
  <LEGEND>Personal Information</LEGEND>
  Last Name: <INPUT name="personal_lastname" type="text" tabindex="1">
  First Name: <INPUT name="personal_firstname" type="text" tabindex="2">
  Address: <INPUT name="personal_address" type="text" tabindex="3">
  ...more personal information...
 </FIELDSET>


FIELDSET was clearly meant as a replacement of TABLE for FORM layout, but sadly is not correctly supported by all browsers. For instance, in Safari :
fieldsetcaca

CSS hype

Above, we saw that standard HTML 4 solutions without CSS layout do offer a widely accepted way of arranging FORM fields with TABLEs, even taking into account disabled people.

Today, CSS standards zealots criticize and ostracize pragmatic solutions that aim more to be universally supported than being "perfectly bigot" :

(1) et la marmote met le chocolat dans le papier alu.

On the other hand, there are still some Resistance active :

They probably understood better than the others the W3C's Techniques for Web Content Accessibility Guidelines 1.0 which state :

How to create tables that transform gracefully
Create a logical tab order through links, form controls, and objects.
Divide large blocks of information into more manageable groups where natural and appropriate.
but don't say "avoid TABLEs" ;-))


This said, CSS are a good thing. Style sheets enable you to apply a different layout to website contents, for example
- when using different media (graphics mode, text mode, print, mobile...)
- for respecting user preferences (for example accessibility by visually-impaired persons)
CSS also allow you to have "visual effects" that do not rely on javascript (or, worse, VBS scripting)

BUT CSS is not the Panacea of web design and has its own flaws and drawbacks.

CSS problems

CSS version 1 is widely supported in graphics-mode browsers ; text-mode browsers are a different story (Lynx, etc); CSS 1 didn't allow for proper TABLE layout.
CSS 2 is not supported exactly in the same way by all browsers (even only modern, state-of-the-art ones), but is nevertheless what you need for TABLE [and FORM] layout.
CSS 3 is generally not supported at all, and should be avoided for the time being.
The problems with CSS come from different causes :
- buggy implementation in browsers (see "bugs")
- non-standard implementation in browsers (see "bugs")
- non-standard "features" replacing standard ones
- rendering engine bugs (see "bugs")

The main area where such "bugs" occur is CSS positioning, which is the basis for CSS zealots who think all the layout should be done using CSS only. As a result, you may end up with completely broken layouts.
One example : broken_skype

More in this little broken CSS layouts collection

Some basic problems :
Doing something as simple as trying to replace a TABLE-based 3 columns layout by a pure CSS layout leads to the following :
css3cols
I count six imbricated DIVs before the contents start while the original layout probably had a single TABLE structure with three TDs in a single TR...


<table width="100%"><tbody>
<tr><td width="100">...</td><td>...</td><td width="100">...</td></tr>
</tbody>
</table>

Is it worth the effort ? For the same result ? Are TABLEs such a mistake ?
What happens when you want a set of DIVs to take the remaining width on a line ?
You've to struggle with width, min-width, left, absolute positioning... and you can't say width="*" as you can with a TABLE... and you end up with a fixed-variable length mix that isn't particularly elegant, was very painful to design, and which will not render the same in all browsers.
Had you used a TABLE, you would only have had to fix the width of the TD that needed it, and let unset the width of the remaining TD while the table would have received width="100%"... very easy.

What happens when the user shrinks the viewport ? (reduces the size of the window) : the TABLE tries to accomodate the remaining space, while the CSS-layout desperates and is cropped...
Proof?

Some Q&As now :

it is clear in the HTML 4.x through XHTML standards that tables are not to be used for layout.

False. We saw above that neither standard says that, and moreover that there is a WAI-compatible way of using TABLES for FORM layout.

I think it can be just as easy to make the form using CSS rather than tables.

False. We saw above that CSS is a pain in the neck for very simple things as "take the rest of the page width" and "please shrink as much as you have to but don't line break" : DIVs can't do it easily, TABLEs are perfect for the task...

4.2. Of (X)HTML Validation first previous next subsection

Having an (X)HTML code that validates is always a good idea ; it may even be the goal to aim for. The problem is that the long-known idea of using custom attributes, allowed by the HTML standards, leads to non-validating pages.
To validate those pages, you've to provide a custom DTD. Your pages will thus validate, except that the online W3C validator may fail to do so as your DTD is not a "public" one.
As explained in this excellent A List Apart article, you simply can NOT have a solution that validates all the time, everywhere : it's either the browser - ie Tidy in Firefox - which fails (to comply with the "internal subset" declaration) or the W3C remote validator which fails on your ***local*** custom DTD, even correctly declared as "SYSTEM", because it simply won't follow the <!DOCTYPE directive correctly and sticks to its own standard DTDs. What a pity ;-)
Hopefully, I can demonstrate that with the very strict SGML Parser (the same as the W3C Validator), the Tidy extension validates the page when writing the DTD in a given way.
One important thing to know is the fact that no XHTML flavour allows the use of custom attributes. You'll have to use cheats or custom DTDs...
Using custom attributes is an excellent way to start separating behaviour from content.
We've succesfully separated presentation and content by the use of CSS. Now it's time to take the next logical step.
The custom attributes technique is a good example. It's powerful, it's cross browser, it's simple, and it does not comply with the XHTML standard. Therefore I cannot follow the XHTML standard. We will have to expand the standard.
One more thing : it is possible to add custom attributes at the CSS "class" level, but they don't serve the same purpose and are not as easy to use and implement as HTML custom attributes. Moreover, we saw above that the less exotic things you do in CSS, the best your page will render on all browsers ;-) So let's not speak about that.
In short : standard HTML does support custom attributes, standard XHTML does NOT support custom attributes.

Let's explore the various custom DTD writing possibilities :

4.2.1. The internal subset solution

Add this to your DOCTYPE declaration when starting (X)HTML output :


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"
[<!ATTLIST input required CDATA #IMPLIED>
<!ATTLIST input fieldname CDATA #IMPLIED>
<!ATTLIST input fieldtype CDATA #IMPLIED>
<!ATTLIST input nofocus CDATA #IMPLIED>
]>

internal subset results Halas, I tried many solutions (putting everything on a single line, for instance) it won't change a iota the results : The output is ugly. You simply CAN NOT get rid of the "]>" displayed on screen !
The "SGML Parser" tells errors...
... and the "HTML Tidy" parser tells warnings...
It doesn't choke on the DOCTYPE declaration, but it doesn't apply it either :/
...BUT the W3C online validator is happy with it ! (it's normal, but has to be noted).
SGML parser on internal subset
... and the "HTML Tidy" parser tells warnings : Tidy HTML parser on internal subset
It doesn't choke on the DOCTYPE declaration, but it doesn't apply it either :/

...BUT the W3C online validator is happy with it ! (it's normal, but has to be noted).

So we've to apply the "custom DTD file" solution in stead. Using an internal subset was elegant but doesn't validate in browsers (for the moment?)
Remember : we work on validating the page ; the page's FORM actually works nicely already !

4.2.2. The custom DTD file solution

In this case, you've to replace your DOCTYPE declaration with one like this :


<!DOCTYPE html SYSTEM "w:/www/extra/formvalidate/customLoose4.dtd">

The last part can be a relative path, an absolute path [a DOCUMENT_ROOT-based path], an absolute file reference [a local file system path], or an URI. In Tidy, I found out that using a relative path (a) doesn't seem to be understood by Tidy HTML parser, (b) leads to this error when using the stricter SGML Parser :


Error: cannot open "%Application Data Path%\%app profile%\tidy/customLoose4.dtd" (No such file or directory)

This is not ***exactly*** what I was expecting...
And using an absolute web-based path does no good :


Error: cannot open "/extra/formvalidate/customLoose4.dtd" (No such file or directory)


So I hardcoded an absolute file system path. It shouldn't be that way, but I'm browser-constrained. Again, it should be a relative path, an URI or an absolute path based on the DOCUMENT_ROOT.

Let's explore now the two possibilities for building a file "customLoose4.dtd"

4.2.2.1. When adding attributes to the end of the file (legibility, maintanability)

First you've to get the standard DTD you want to adapt and save it on disk. In our case, the "loose" HTML4 version seems ideal.
This is the start of the file :


<!--
    VGR Apr 2008 : this is a modified custom version of :

    This is the HTML 4.0 Transitional DTD, which includes
    presentation attributes and elements that W3C expects to phase out
    as support for style sheets matures. Authors should use the Strict
    DTD when possible, but may use the Transitional DTD when support
    for presentation attribute and elements is required.

    HTML 4.0 includes mechanisms for style sheets, scripting,
    embedding objects, improved support for right to left and mixed
    direction text, and enhancements to forms for improved
    accessibility for people with disabilities.

          Draft: $Date: 1998/04/02 00:17:00 $

          Authors:
              Dave Raggett <dsr@w3.org>
              Arnaud Le Hors <lehors@w3.org>
              Ian Jacobs <ij@w3.org>

    Further information about HTML 4.0 is available at:

        http://www.w3.org/TR/REC-html40
-->
<!ENTITY % HTML.Version "-//W3C//DTD HTML 4.0 Transitional//EN"

and the file ends as :


<!ELEMENT HTML O O (%html.content;)    -- document root element -->
<!ATTLIST HTML
  %i18n;                               -- lang, dir --
  %version;
  >

The first idea is to append a section like that one :


<!--============= VGR Custom Attributes for TabBar Handling ============-->

<!ATTLIST table codigoTabBar CDATA #IMPLIED>
<!ATTLIST td cantidadTitulos CDATA #IMPLIED>
<!ATTLIST td tituloTab_0 CDATA #IMPLIED>
<!ATTLIST td tituloTabLink_0 CDATA #IMPLIED>
<!ATTLIST td imagenTituloTab CDATA #IMPLIED>
<!ATTLIST td miTarget CDATA #IMPLIED>
<!ATTLIST td enabledTab CDATA #IMPLIED>

It works (or should) but I found out that the HTML validators have a better time working when you add your custom attributes directly in the standard sections (elements). So let it be...

4.2.2.2. When adding attributes to the existing section of the DTD (it seems to better work ;-)

In the standard DTD file, do a search on "ATTLIST INPUT" (because we want to add custom attributes to INPUT tags, amongst other things) :
The new section will look as :


<!-- attribute name required for all but submit & reset -->
<!ELEMENT INPUT - O EMPTY              -- form control -->
<!ATTLIST INPUT
  %attrs;                              -- %coreattrs, %i18n, %events --
  type        %InputType;    TEXT      -- what kind of widget is needed --
  name        CDATA          #IMPLIED  -- submit as part of form --
  value       CDATA          #IMPLIED  -- required for radio and checkboxes --
  checked     (checked)      #IMPLIED  -- for radio buttons and check boxes --
  disabled    (disabled)     #IMPLIED  -- unavailable in this context --
  readonly    (readonly)     #IMPLIED  -- for text and passwd --
  size        CDATA          #IMPLIED  -- specific to each type of field --
  maxlength   NUMBER         #IMPLIED  -- max chars for text fields --
  src         %URI;          #IMPLIED  -- for fields with images --
  alt         CDATA          #IMPLIED  -- short description --
  tabindex    NUMBER         #IMPLIED  -- position in tabbing order --
  accesskey   %Character;    #IMPLIED  -- accessibility key character --
  onfocus     %Script;       #IMPLIED  -- the element got the focus --
  onblur      %Script;       #IMPLIED  -- the element lost the focus --
  onselect    %Script;       #IMPLIED  -- some text was selected --
  onchange    %Script;       #IMPLIED  -- the element value was changed --
  accept      %ContentTypes; #IMPLIED  -- list of MIME types for file upload --
  align       %IAlign;       #IMPLIED  -- vertical or horizontal alignment --

  required    (required) #IMPLIED  -- VGR16042008 ADDed --
  fieldname   CDATA #IMPLIED  -- VGR16042008 ADDed --
  fieldtype   CDATA #IMPLIED  -- VGR16042008 ADDed --
  nofocus     (nofocus) #IMPLIED  -- VGR16042008 ADDed --

  %reserved;                           -- reserved for possible future use --
  >

The portion in bold is the one I added. I also added two custom attributes in SELECT. And it works AND validates now ;-)

And then ***at last*** we get the expected results with the strict SGML Parser :
validated HTML
Nota Bene :The Tidy "HTML Parser" still seems not to understand our custom attributes, but as the stricter SGML Parser seems satisfied, I'm pretty confident.

Caveat : there seems to be a limit (60) in the SGML Parser in the maximum number of attributes an element can have. So in case you added too many new attributes, you've to extract some unused ones... Or recompile the SGML Parser ;-)
In my case, I got rid of this :


usemap      % URI;          # IMPLIED  -- use client-side image map --

It means I will no longer be able to use "usemap" on an INPUT. I'm not too disturbed by that ;-))

Some references about custom DTDs :
about custom attributes
about custom DTD
about custom DTD

4.3. Of Portability first previous next subsection

Given the various discrepancies between browsers when it comes to support CSS, and the fact that ALL browsers understand correctly the basic tag TABLE, it should be crystal clear what the correct "FORM layout" solution is ;-))
In fact, given the faulty IE box model, we shouldn't not even use DIVs with IE ;-)
For instance, HOW to safely transform a TABLE into a set of DIVs when the min-width CSS property is not correctly supported ?
A complete comparison chart for layout properties :
CSS layout properties support chart
(from http://www.westciv.com/style_master/academy/browser_support/page_layout.html)

Some references :

4.4 Possible Solutions first previous subsection

Now that our solution does validate, are there other solutions ?

Our solution :

Here is the description of the validation library that I propose (5920 bytes) :
Summary : you can add attributes on your FORM fields, like REQUIRED, NOFOCUS (will not receive focus after a validation error), FIELDNAME= (the name to display on the alert) and FIELDTYPE= with a lot of field types.


Code:

Example : <select name="ZipC" REQUIRED fieldtype="SELECT" fieldname="Zip Code">

Library Synopsis :
//                    Possible attributes to add in tags (children of form tag):
//                      - REQUIRED: field is required and checks will be performed
//                      - NOFOCUS: if error found on field, no focus will be done
//                      - FIELDNAME="": name of field to be displayed in JavaScript alert if any error found
//                      - FIELDTYPE="SELECT": ensures a selection is made
//                      - FIELDTYPE="TEL": checks if entry is telephone (or null)
//                      - FIELDTYPE="EMAIL": checks if entry is email (or null)
//                      - FIELDTYPE="ALPHA": checks if entry is alpha (or null)
//                      - FIELDTYPE="ALPHANUMERIC": checks if entry is alphanumeric (or null)
//                      - FIELDTYPE="NUMERIC": checks if entry is numeric (or null)
//                      - FIELDTYPE="INTEGER": checks if entry is integer (or null)
//                      - FIELDTYPE="UINTEGER": checks if entry is unsigned integer (or null)
//                      - FIELDTYPE="FLOAT": checks if entry is float (or null)
//                      - FIELDTYPE="UFLOAT": checks if entry is unsigned float (or null)

You can find my form and test it at http://www.fecj.org/extra/formvalidate/tableform.php

I also dropped a more complete online testing script at http://www.fecj.org/extra/formvalidate/formvalidate_sample.php


There is an other solution as explained in the a companion ERT article written by Rdivilbiss on the same matter, but answering a different initial problem, and offering a different solution due to diverging opinions on some sub-considerations above :D
Rod's solution :

Rod's test form is accessible at http://www.cafesong.com/ert/cssform1ul.htm

Supposing that solution also validates, we could now compare them for performance and easyness of implementation.

5. On Performance top previous next section

Dynamic performance was not tested, but a static comparison is possible :
tableform.php has FORM fields with all their attributes, submit-handling code and the TABLE-based layout occupies less than 3882 bytes(³).
cssform1ul.htm has FORM fields, no submit-handling part and the CSS-based layout occupies 5475 bytes.
The validation library of tableform.php occupies 5920 bytes.
The validation javascript+regexp part of cssform1ul.htm was missing.

(³) Personally, I would have designed the FORM differently, to benefit of the TABLE container, and the layout would have taken 2800 bytes or so.


Both FORMs do degrade exactly the same when the window becomes too narrow (²). Both behave the same in Firefox. Both do NOT behave the same in an old Internet Explorer ;-)))
(the market shares of that IE version was 13% at the time of this library writing, 3% when this article was started, and 1.1% now ;-)

(²) because of the fixed overall FORM width of Rod's FORM, see here for a degradation comparison where TABLE shows its superiority.


But the bad news are that both forms do not even show as identical in Internet Explorer 6 SP2 ...
(and the market share of ***that one*** is still around 30% )
Proof ?

6. On Cost Effectiveness top previous next section

You'll see per yourself ;-)

9. Conclusion top previous section

The best solution is ours, clearly.
I tried to show as accurately as possible the differences in the two solutions, explaining the "pros and cons" each time we had a decision to make.

Here is the set of files I used and designed :

  1. the validation library
  2. the custom DTD
  3. Rod's FORM done with TABLEs
  4. the complete test script




Best regards,
VGR for European Experts Exchange and Experts Round Table
Last update 2009-08-12 12:10:58

 Add This Article To:
 del.icio.usDel.icio.us  diggDigg  googleGoogle  spurlSpurl
 blinkBlink  furlFurl  simpySimpy  yahooY! MyWeb