Class: Arrow::HTMLTokenizer
- Inherits:
 - 
      Object
      
        
- Object
 - Object
 - Arrow::HTMLTokenizer
 
 - Includes:
 - Enumerable
 - Defined in:
 - lib/arrow/htmltokenizer.rb
 
Overview
The Arrow::HTMLTokenizer class — a simple HTML parser that can be used to break HTML down into tokens.
Some of the code and design were stolen from the excellent HTMLTokenizer
library by Ben Giddings 
VCS Id
 $Id$
Authors
Michael Granger
:include: LICENSE
—
Please see the file LICENSE in the top-level directory for licensing details.
Constant Summary
- SVNRev =
          
SVN Revision
 %q$Rev$- SVNId =
          
SVN Id
 %q$Id$
Instance Attribute Summary
- 
  
    
      - (Object) scanner 
    
    
  
  
  
    readonly
    
  
  
  
  
  
    
The StringScanner doing the tokenizing.
 - 
  
    
      - (Object) source 
    
    
  
  
  
    readonly
    
  
  
  
  
  
    
The HTML source being tokenized.
 
Instance Method Summary
- 
  
    
      - (Object) each 
    
    
  
  
  
  
  
  
  
    
Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.
 - 
  
    
      - (HTMLTokenizer) initialize(source) 
    
    
  
  
    constructor
  
  
  
  
  
  
    
Create a new Arrow::HtmlTokenizer object.
 
Methods inherited from Object
deprecate_class_method, deprecate_method, inherited
Methods included from Loggable
Constructor Details
- (HTMLTokenizer) initialize(source)
Create a new Arrow::HtmlTokenizer object.
      41 42 43 44  | 
    
      # File 'lib/arrow/htmltokenizer.rb', line 41 def initialize( source ) @source = source @scanner = StringScanner.new( source ) end  | 
  
Instance Attribute Details
- (Object) scanner (readonly)
The StringScanner doing the tokenizing
      55 56 57  | 
    
      # File 'lib/arrow/htmltokenizer.rb', line 55 def scanner @scanner end  | 
  
- (Object) source (readonly)
The HTML source being tokenized
      52 53 54  | 
    
      # File 'lib/arrow/htmltokenizer.rb', line 52 def source @source end  | 
  
Instance Method Details
- (Object) each
Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.
      60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84  | 
    
      # File 'lib/arrow/htmltokenizer.rb', line 60 def each @scanner.reset until @scanner.empty? if @scanner.peek(1) == '<' tag = @scanner.scan_until( />/ ) case tag when /^<!--/ token = HTMLComment.new( tag ) when /^<!/ token = DocType.new( tag ) when /^<\?/ token = ProcessingInstruction.new( tag ) else token = HTMLTag.new( tag ) end else text = @scanner.scan( /[^<]+/ ) token = HTMLText.new( text ) end yield( token ) end end  |