HTML5 and The Semantic Web

Linking Open Data dataset cloud as of July 14t...

Image via Wikipedia

Since HTML5’s uptake in mainstream browsers, there’s been a lot of talk about the next version of the web, web 3.0 (even though Tim Berners-Lee dislikes that term).

The “next version” of the web, web 3.0, is also called the “Semantic Web” by many of the leading engineers working on HTML5 and other web standards.  The driving idea behind the semantic web is that computers, not humans, will be reading, interpreting, and digesting information from websites and web pages more than humans.  Unfortunately, since the data on the web is often in forms that make it computationally complex to parse or recognize, new HTML tags and standards had to be developed and integrated with HTML5 to provide this functionality.  Let’s look at a few examples of why this problem required such a solution:

Lets take a simple date, like the 21st of March, 2011.  A human can read this instantly and understand that I am talking about a date.  A computer, however, has to read the line, verify that this matches some sort of date pattern, and match that to a date pattern to use it.

An even harder sample to interpret would be 2/3/2011.  In countries other than the United States, this means the 2nd of March, 2011.  If this was written by an American, it could mean the third of February.  The computer must do additional research, or ask the user for verification of the actual date.  Either solution is undesirable.  To fix this problem we have some new tags in HTML5.  In this case, the <time> tag will help us out.

Instead of just writing that the concert is on the 4th of March, 2011 at 8pm, we write:
The concert is on the <time datetime=”2011-04-03T20:00+05:00”>3rd of March, 2011.

The datetime attribute allows you to specify time by placing a “T” followed by the 24-hour time, +/- the timezone offset (use 00:00 for zulu time).  The pubdate attribute works the same way but denotes when an article has been published.

While time is probably the best example for the semantic web changes in HTML5, it is by far not the only one.  The following is a list of some of the semantic web tags included in HTML5 (let me know if you have any good ones you’d like to see added!)

  • <address> — Specifies contact information for the author of an article.
  • <article> — Denotes an article, a block of information that stands on its own.
  • <hgroup> — Group <h#> elements together to create a cleaner DOM flow for interpreters.
  • <details> — Gives additional information/controls to be shown or hidden.
  • <summary> — Summary of a document, used inside the <details> tag.
  • <figure> — Denotes a figure (like a chart, picture, or other self-contained object)
  • <figcaption> — Gives the figure a caption.
  • <abbr> — Denotes an abbreviation and its expansion.
  • <del> — Strike out the text between the <del> tags.
  • <ins> — Denotes inserted new text.  Usually after <del>.
  • rel=”” — Attribute for links and hyperlinks.  It tells the browser (and more importantly, search engines) what relevance the linked document has to the current document.  Some examples are provided here.

The reason why these tags are so important really boils down to the heart of semantics, which is the ability for machines to understand the data that we are feeding them.  Thus, by adding these tags, we can do much more targeted search patterns.  For example, imagine a search engine in which you can search for all the news articles published between 2010 and 2011 by author “X”, but only those that happened to link to videos in the articles.  This is only one example of everyday consumer use.  Enterprise use could have much more of an impact for internal search engines and document management, especially for law and security firms that need to keep hundreds of thousands or even millions of documents.  Instead of being overwhelmed with search information or having to somehow add all sorts of document information, by inputting documents with simple HTML markup, many document management problems could be eliminated.  Imagine a world where you can cross reference all articles containing the date “1-1-2011” with mentions of “New Years Party” in the document summary.  That’s what the semantic web is all about — easy, built-in data mining.

That’s pretty cool, but the descriptors that have been given to us by HTML standards are not even close to what we need to describe vague things like hot or cold, or people who aren’t authors of the page.  Of course, you could probably add things into the HTML tags like classes inside of <span> tags (many microformats do this!) but this turns out to be a pretty inelegant solution because it quickly balloons your filesize if you have many things that you wish to tag or mark inside of your documents.

A Brief Intro To RDF

Enter the bold new world of RDF — The Resource Description Framework, where everything can be linked to anything and even queried for information, just like a SQL relational database (with similarly styled structure and a similarly named query language called SPARQL).  The idea behind this initiative is to give the entire web some sort of way to crawl over a document and what it contains more easily.  Anything can be described by RDF, as illustrated in this example from the Wikipedia Article on RDF

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/">
        <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn">
                <dc:title>Tony Benn</dc:title>
                <dc:publisher>Wikipedia</dc:publisher>
        </rdf:Description>
</rdf:RDF>

In the future, look for more RDF and a few, other, formats that make use of the same principles, as well as another article from CTOvision.com about RDF and how to leverage it.  RDF and its brethren are far too complex and comprehensive to fit in as a side note to HTML5, as the example well illustrates.

In my next article about HTML5, I will go over the security features present within HTML5 and any security challenges that HTML5 presents to organizations.

Sign up for your free CTOvision Pro trial today for unique insights, exclusive content and special reporting.

CTOvision Pro Special Technology Assessments

We produce special technology reviews continuously updated for CTOvision Pro members. Categories we cover include:

  • Analytical Tools - With a special focus on technologies that can make dramatic positive improvements for enterprise analysts.
  • Big Data - We cover the technologies that help organizations deal with massive quantities of data.
  • Cloud Computing - We curate information on the technologies enabling enterprise use of the cloud.
  • Communications - Advances in communications are revolutionizing how data gets moved.
  • GreenIT - A great and virtuous reason to modernize!
  • Infrastructure  - Modernizing Infrastructure can have dramatic benefits on functionality while reducing operating costs.
  • Mobile - This revolution is empowering the workforce in ways few of us ever dreamed of.
  • Security  -  There are real needs for enhancements to security systems.
  • Visualization  - Connecting computers with humans.
  • Hot Technologies - Firms we believe warrant special attention.

 

Recent Research

Request Your Invite to the 20 May 2014 Andreessen Horowitz Fed Forum in DC

Amazon Hopeful that Fire TV will Spread

What The Enterprise IT Professional Needs To Know About Git and GitHub

3D Printing… At Home?

Tech Firms Seeking To Serve Federal Missions: Here is how to follow the money

Creating The New Cyber Warrior: Eight South Carolina Universities Compete

Mobile Gamers: Fun-Seeking but Fickle

Update from DIA CTO, CIO and Chief Engineer on ICITE and Enterprise Apps

Pew Report: Increasing Technology Use among Seniors

Finding The Elusive Data Scientist In The Federal Space

DoD Public And Private Cloud Mandates: And insights from a deployed communications professional on why it matters

Intel CEO Brian Krzanich and Cloudera CSO Mike Olson on Intel and Cloudera’s Technology Collaboration

solid
About BryanHalfpap

Bryan Halfpap is a software programmer, technology analyst and writer and a driving force behind the security reporting at CTOvision.com He is a frequent speaker at events and conferences including Defcon. You can find him on twitter: @crypt0s