Dark Moon Velvet

Posts Tagged ‘html

Nothing special, just a ruby script.

We start with something such as this:

# is script executed?
if __FILE__ == $0

end

This is our main program (sort of). We basically check if the script was executed or included by checking if the current file is equal to the initially executed file.

Now what I want to do is read some files passed as arguments and process them in some way then write back a processed file, so:

# is script executed?
if __FILE__ == $0
  ARGV.each do |arg|
  end
end

We have each argument in as arg.

# is script executed?
if __FILE__ == $0
  ARGV.each do |arg|
    begin

    rescue Exception => e
      puts "Error: #{e.to_s}"
    end
  end
end

We’re going to read from some files, might as well add some exception handling (nothing fancy).

# is script executed?
if __FILE__ == $0
  ARGV.each do |arg|
    begin
      path = Dir.getwd
      puts "Processing file: #{path + "/" + arg}"
      html = HtmScarab.new(path + "/" + arg)
      File.open(path + "/" + "se-" + arg, "w") do |file|
        file.puts html.eval
      end
    rescue Exception => e
      puts "Error: #{e.to_s}"
    end
  end
end

We read the file, process it and then write it back with a “se-” prefix. The file is closed for us at the end of the File/end block (good ol’ ruby).

Above this code we need to create the business end of the script. So we add the following class:

class HtmScarab
  # class for converting from html to "semantic sugar",
  # essentially the eval method of this class will remove
  # non semantic html elements

  def initialize html_file
    @html = ""
    File.open html_file, 'r' do |file|
      while line = file.gets
        @html += line
      end
    end
  end

end

We have a constructor that accepts a file and reads it into a field called html. This is what we wanted to use looking at what we wrote in the main part of the script so this is what we need to write:

class HtmScarab
  # class for converting from html to "semantic sugar",
  # essentially the eval method of this class will remove
  # non semantic html elements

  def initialize html_file
    @html = ""
    File.open html_file, 'r' do |file|
      while line = file.gets
        @html += line
      end
    end
  end

  def clean
    regex = /\n|\r/mi
    @html.gsub! regex, ' '
    regex = /\s\s*/mi
    @html.gsub! regex, ' '
  end

end

We had a cleaning function to remove spaces and newlines as they might get in the way.

class HtmScarab
  # class for converting from html to "semantic sugar",
  # essentially the eval method of this class will remove
  # non semantic html elements

  def initialize html_file
    @html = ""
    File.open html_file, 'r' do |file|
      while line = file.gets
        @html += line
      end
    end
  end

  def clean
    regex = /\n|\r/mi
    @html.gsub! regex, ' '
    regex = /\s\s*/mi
    @html.gsub! regex, ' '
  end

  # the heart of the operation!
  def eval
    if (@html)
      list = []
      # the following are not semantic or are unnecessary:
      list << Pair.new(/<head.*<\/head>/mi, "")
      list << Pair.new(/\s*class=\".*?\"/mi, "")
      list << Pair.new(/<\/?(div|span).*?>/mi, "")
      list << Pair.new(/<script.*?<\/script>/mi, "")
      list << Pair.new(/<style.*?<\/style>/mi, "")
      list << Pair.new(/<\?xml-stylesheet.*\?>/mi, "")
      list << Pair.new(/<!--.*?-->/mi, "")

      # what was I doing?
      list.each do |pair|
        @html.gsub! pair.regex, pair.value
      end

      clean
      # return
      @html
    end
  end

end

And of course we do some quick regex seek and destroy. It may not be great but it gets the job done… well not quite, I just invented the class Pair as I went by, because it was convinient, so time to create it with all the functions we need:

class Pair
  attr_accessor :regex, :value

  def initialize regex, value
    @regex = regex
    @value = value
  end

end

The point

You might be tempted to add more methods and so on to either the Pair or Scarab class. Don’t! It’s a waste of time, and effort, even if they look incomplete as they are; overengineering (anything) will only eventually cause it to be unnecessary complicated and eventually harder to understand. A lot of programers will occasionally use their “god given foresight” to create all sorts of extra functions for the future. The consequence is classes with all sorts of useless dangling bits nobody ever needs.

The incremental way I create the script in the example above is not always possible for any program; but do try to at least sketch up a prototype application and thus create the application starting from the functionality inward rather then conceiving and presuming usability and usefulness.

In the case of ruby adding useless methods when they are not needed is even more useless then other languages. Suppose we want to reuse a object of our Scarab class, we would need to add a extra method. It goes something like this:

class HtmScarab
  def set value
    @html = value
  end
end

So, I opened the class by writing class HtmScarab / end anywhere in my code, then added the new method I need. It’s simple, clean and in a way efficient.

I’m sure by now everyone know that a tag is a word starting with a letter enclosed within “<” (lower then) and “>” (greater then), and how it is highly recommended we should close them so as to avoid confusion, blah blah. But, semantics are not just going to write themselvs just by knowing that, and I find many people do not actually know what the heck it is they are writing.

Normally we start small, but that’s so boring, so here’s a full page:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/css" href="index.css" ?>

<!DOCTYPE 
   html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
>

<html xmlns="http://www.w3.org/1999/xhtml">

   <head> 
      <meta http-equiv="Content-Type" 
               content="text/html;charset=ISO-8859-1" />
      <link rel="stylesheet" type="text/css" 
               href="index.css" />
      
      <title>Untitled</title>
      
      <style type="text/css">
      /** page specific style **/        
      </style>
      
      <meta name="description" content="Lorem ipsum." />
      <meta name="keywords" content="lorem, ipsum" />
      <meta name="author" content="velvet" />
      
      <meta name="distribution" content="global" />
      
      <link rel="copyright" href="#" />
      <link rel="help" href="#" />
   </head>
   
   <body>
      <h1>My Blog</h1>
      <h2>Lorem ipsum 2009</h2>
      <p>Lorem ipsum dolor sit amet, [...] </p>
      <p>Nulla facilisi. Vivamus erat neque, [...] </p>
      <p>Vivamus semper convallis enim. [...]</p>
      <h3>Comments</h3>
      <p>Vestibulum dignissim placerat magna.</p>
      <p>Cras hendrerit, dolor at semper rhoncus, 
      est odio sodales ligula, ut ante.</p>

      <h2>Lorem Ipsum 2008</h2>
      <p>Lorem ipsum dolor sit amet, [...] </p>

...
      
      <script type="text/javascript" src="index.js">
      </script>   
   </body>
   
</html>

I’ll explain each line starting from the top.

XML and DTD

<?xml version="1.0" encoding="ISO-8859-1"?>

Because I am writing XHTML (ie. “eXtensible HTML“) my page is (to some extent) a xml document, so it is only natural I treat it as such.

The line is a standard (I say this because it is easily overwriten) declaration of the document as XML, in our case its I’m saying:

This is a XML document using the 1.0 specifications, and using the character encoding ISO-8859-1.

Now, I did say “it is easily overwriten” and you would be interested to know that all major browsers will not care much for you writing it. Instead, they will determine what your document is (this includes all types of files) by which MIME type the server specifies for your document when it is sent. However should your document be saved to disk, the browser no longer has this convenience and will look at the above line.

Why do you need it: If your document is XML, its mandatory to have this. Parsers will throw an error should it be omitted.

<?xml-stylesheet type="text/css" href="index.css" ?>

This line specifies the Css stylesheet using xml syntax (I specify it bellow in html too, but no harm here). Translation:

Style this content with the stylesheet writen in “index.css” (located in the current folder). The style sheet has the MIME: text/css.

Why do you need it: Devices that understand very purist xhtml syntax may like it.

<!DOCTYPE 
   html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
>

This a doctype (Document Type Declaration) declaration. It tells the browser what tags go in what tags, what attributes are valid for each tag, and so on and so forth. And it is very important, as I shall explain bellow.

Fist the basics, a doctype declaration starts with a <!DOCTYPE and ends with >, I won’t go into detail about how to write one but I will explain what the code snippet we have does.

In the above doctype declaration we have linked the public (as in known by default by browsers) declaration of — in our case — xhtml strict document to the html tag (the root of our document). By linking it in, we have also declared all other enclosed elements by it as abiding by said doctype specifications.

The extra uri within quotes specifies a raw copy of the DTD (you can go there to see all the code). This is optional since just providing the public identifier is sufficient, if you wish you can write the entire declartion as:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN">

Why do you need it: Modern browsers have to work with both old and new. So what happens when they see a page? should they run it though the gauntlet of code fixing when processing or trust that you were competent enough to write it correctly. Obviously that’s a hard decition, so they use what’s referred to as a doctype switch. Depending on what doctype you chose they will run more or less code fixing. This will both effect inconsistencies while designing (some Css may not work well if at all should you have a incorrect doctype) and also end-user performance. You can see a very simple behaviour chart created by Opera.

Just to be clear, a DTD is a HTML standard not a XML one, XML’s equivalent to DTDs is a Schema but since both are interchangeable as function go, and since browsers understand DTD better then anything else (and we just need to specify one not write one), its better to use DTDs for HTML pages.

Moving on to actual HTML…

<html xmlns="http://www.w3.org/1999/xhtml">

Start HTML markup, using XML namespace (xmlns): “http://www.w3.org/1999/xhtml&#8221;.

Do not be confused by xml namespaces tending to look as URLs, that is just because its easy to be unique that way. If the standards (set by w3c) had chosen the URI (Universal Resource Identifier) as using the format such as that used in Java: org.w3.1999.xhtml, we would be writing that.

Why do you need it: It identifies all elements including the html element in which it resides as belonging to XHTML, this does not imedialty become useful, and with today’s “standards in writing pages” and rendering by browsers (particularly desktop oriented ones) for most developers they are not very useful, but should you move to inserting other XML documents inside, it becomes useful. Consider something like this:

...
<html 
   xmlns="http://www.w3.org/1999/xhtml"
   xmlns:blog="org.example.blog.something.standard"
>
   <blog:pagetitle>My blog</blog:pagetitle>
   <blog:title>Post 1</blog:title>
   <p> ... </p>
   <img src=" ... " alt=" ... " />
   <p> ... </p>
   <p> ... </p>
...
   <blog:title>Post 2</blog:title>
   <p> ... </p>
...
</html>

Ok moving on to the html <head> section or “do not print this stuff on the page” / “meta-data” section.

<meta http-equiv="Content-Type"  
            content="text/html;charset=ISO-8859-1" />

I declare the content of this document as being text/html writen with the character set defined by ISO-8859-1.

Why do you need it: This is the standard HTML declartion for content type. This declaration should appear in a html document, however since the move to xml this declaration has become somewhat redundant and there probably will not be any issue removing it. Remember that once placed in the document if the if the browser detects conflicting settings here to and what its been told, it will go back to the top and restart processing with the charset mentioned in this meta tag (some of them will), so place it at the very top of the head element to avoid useless processing.

<link rel="stylesheet" type="text/css"  
               href="index.css" />  

A HTML declaration for the stylesheet, everything here is read the same as the xml one. If anyone is wondering why its called a generic “index.css”, its because its highly recommended to merge all your style sheets into one to avoid delaying page load with too many http requests to the server. I suggest you avoid separating different media stylesheets and instead use @media Css rule, as the gain from separating is little to nonexistent.

<title>Untitled</title>  
  
<style type="text/css">  
/** page specific style **/  
</style>  
  
<meta name="description" content="Lorem ipsum." />  
<meta name="keywords" content="lorem, ipsum" />  
<meta name="author" content="velvet" />  
  
<meta name="distribution" content="global" /> 

These are all very simple metadata which does pretty much what it says. I suggest reading more into SEO to find what they do, as well as what you should be doing and not doing with them.

<link rel="copyright" href="#" />  
<link rel="help" href="#" />

With these I am linking to documents which have a relationship with the current document; I’ve inserted those as a example, links to such documents is not necessary and you might be doing inside the <body> block. Note that the relationship is not random.

Why would you use such things: some programs make use of such metadata to improve the user interface.

Moving on to the body, I’ll start with the end…

<script type="text/javascript" src="index.js">  
</script>  

Why do I have all my javascript at the bottom of the page? The answer is simple: to avoid it loading before content. Lets say I have a huge script and some content it is applied to, the content in question is also perfectly viewable/usable with out the script, so then why waste time waiting for the script… It doesn’t make sense so we place the script as the last node in the body thus loaded last, this also avoid posible errors where javascript DOM alterations are not applied to some nodes which were not loaded at the time of the scripts execution (in certain incompetent browsers).

Semantics in HTML

Moving to content,

Good example

<h1>My Blog</h1>  
<h2>Lorem ipsum 2009</h2>  
<p>Lorem ipsum dolor sit amet, [...] </p>  
<p>Nulla facilisi. Vivamus erat neque, [...] </p>  
<p>Vivamus semper convallis enim. [...]</p>  
<h3>Comments</h3>  
<p>Vestibulum dignissim placerat magna.</p>  
<p>Cras hendrerit, dolor at semper rhoncus,  
est odio sodales ligula, ut ante.</p>  
  
<h2>Lorem Ipsum 2008</h2>  
<p>Lorem ipsum dolor sit amet, [...] </p>

It may not look it to some but that is how every proper XHTML sematic webpage should look, once striped to the bone of any spans, classes, divs and other presentation markup. To show how the above code works, lets consider the following — ever so common on forum software — bad example:

Bad example

<div id="header">
   <img src="header.jpg" alt="My blog" />
</div>  

<h4>Lorem ipsum 2009</h4>  
<div class="content">
Lorem ipsum dolor sit amet, [...] <br /> <br />
Nulla facilisi. Vivamus erat neque, [...] <br /> <br />
Vivamus semper convallis enim. [...] <br /> <br />
</div>
<em><strong>Comments</strong></em>
<div>Vestibulum dignissim placerat magna.</div>  
<div>Cras hendrerit, dolor at semper rhoncus,  
est odio sodales ligula, ut ante.</div>  
  
<h4>Lorem Ipsum 2008</h4>  
<div class="content">
Lorem ipsum dolor sit amet, [...] 
</div>

Just looking at it as a comparison it becomes evident something is horribly wrong. But lets drill though it to show just what exactly it is that is wrong and how.

First thing first, the site’s name/branding. In the good example, the title is placed in the once-per-page <h1> tag giving it maximum importance and naming the entire document; placing more then one <h1> tag would semantically mean more then one document. In the bad example the title of the page is placed as merely the alt of a image; semantically and from a SEO perspective it might as well not have been placed at all; remmeber the <title> in the <head> should (SEO wise) and is the title of the current page not the site, but it should not be a stand in for the current page’s title, since it is metadata not page content.

Moving on to the next error. If you look at the title of the posts, you’ll notice how the bad example has a <h4>. Ever since HTML first came to be, every hobbist tutorial site out there labeled the h1, h2, h3 etc as being headers with different degree of importance, and subsequently the genral populace (and more hobbists) continued the tradition of ranking content based on their bias and giving it a h label from 1 to 6. This is complete semantic nonsense and just to get things straight:

You are not helping crawlers and the web in any way by “ranking headers”!

Take the following example:

<h4> ... </h4>
<p> ... </p>
<p> ... </p>
<h1> ... </h1>
<p> ... </p>
<h3> ... </h3>
<p> ... </p>

Can you tell in which order that data should be semantically ordered. No, and neither can the web.

Headers are like nested lists, you always start with a <h1> (the “importance” is where you decide to start with it), you always use a <h2> for sub-content and another <h1> if its adjesent content. Once you used a <h2> you would use another <h2> for content of similar importance or a <h3> for sub-content to that. And so on and so forth:

<h1> ... </h1>
   <h2> ... </h2>
   <h2> ... </h2>
      <h3> ... </h3>
   <h2> ... </h2>
   <h2> ... </h2>
      <h3> ... </h3>
      <h3> ... </h3>
      <h3> ... </h3>
      <h3> ... </h3>
   <h2> ... </h2>

Now your entire document makes sense, every section defined by a header can be compared logically with any other; and thus subsequently data in that section as well. Compared to the complete randomness in the earlier example it is a huge improvement.

Moving on to the difference in writing content, lets look at the good and bad side by side:

<p>Lorem ipsum dolor sit amet, [...] </p>  
<p>Nulla facilisi. Vivamus erat neque, [...] </p>  
<p>Vivamus semper convallis enim. [...]</p>  
<div class="content">  
Lorem ipsum dolor sit amet, [...] <br /> <br />  
Nulla facilisi. Vivamus erat neque, [...] <br /> <br />  
Vivamus semper convallis enim. [...] <br /> <br />  
</div>  

In case your wondering the “[…]” means nothing special. It is the typographic way of saying “inserted content”, with the inserted content defined by square braces (in our case a ellipse for: “more”).

What is a <p>? I’ll tell you what it is not: a <p> is not a block of text with a empty line at the end, it is a “idea” or block delimiter for a message. You do not write <p>‘s just because they look like paragraphs, they have semanic value!

What is a break? A break delimites line data in html elements where it makes sense, such as the <address> element, think of phone, street, city etc, uncountable data. The <address> is used as for the author (of the page) information inside the page content; it was made in “simpler times” hence address, don’t missuse it by placing countless addresses on of people in it, it makes no sense if they are not the authors of the page where <address> is placed.

So now knowing that, how much sense does it make to insert two consecutive breaks (there is no real sematic use where you would use two!) instead of a paragraphs? To put it simply what is happening here is that three ideas are turned into one marvelous blob of text, though a hack to the semantic markup, with god knows what meaning; as much as this could mean a paragraph it could also mean a quote or anything (preformated sample computer code anyone?) since the enclosure is not a clear semantic delimiter but a div, which is used to mark semantic markup but has no semantic meaning itself.

Onto the last piece of semantic desaster, consider the following, again good and bad example side by side:

<h3>Comments</h3>  
<p>Vestibulum dignissim placerat magna.</p>  
<p>Cras hendrerit, dolor at semper rhoncus,  
est odio sodales ligula, ut ante.</p>
<em><strong>Comments</strong></em>  
<div>Vestibulum dignissim placerat magna.</div>  
<div>Cras hendrerit, dolor at semper rhoncus,  
est odio sodales ligula, ut ante.</div>

I already talked about headers and paragraphs and their importance above, but lets look at what is happening here with the alternative “emphisized” comment title. First of all even though it may seem correct (since we’re going to presume here the there is a enclosing block) to place those inline nodes there, do not do it! Blocks should follow blocks and most certainly inline elements should only be inside blocks not adjesent to them. Seccondly, placing that double emphasis is quite simply useless, there is no such thing as “more emphisized”, even though you want it to be so, so avoid double emphasizing something unless it’s a special case where your emphisizing part of something which is already emphasized.

The rest of the problem is obvious to to write it down: the comments and post content are being merged since obviously the emphized text betwean them is nothing but a mere paragraph; in many situations this merger is not desired.

Semantically speaking in sertain situation it is fair use to lets say “over emphasize” a sentence as a visual que to the reader and to avoid placing a title. This can be subsequently made to look as a title while semantically acting as a “anchor”.

Tip

Try to, start from the semantics outwards. Not from:

<div class="grabage code navigation"> ... </div>
<div class="grabage code header"> ... </div>
<div class="grabage code footer"> ... </div>

That is all (for now)

Do not worry you shall forget it soon enough.