|  | 
   
 |   |  |   
  
    | HTML::Defang(3) | User Contributed Perl Documentation | HTML::Defang(3) |  
HTML::Defang - Cleans HTML as well as CSS of scripting and other
    executable contents, and neutralises XSS attacks.   my $InputHtml = "<html><body></body></html>";
  my $Defang = HTML::Defang->new(
    context => $Self,
    fix_mismatched_tags => 1,
    tags_to_callback => [ br embed img ],
    tags_callback => \&DefangTagsCallback,
    url_callback => \&DefangUrlCallback,
    css_callback => \&DefangCssCallback,
    attribs_to_callback => [ qw(border src) ],
    attribs_callback => \&DefangAttribsCallback,
    content_callback => \&ContentCallback,
  );
  my $SanitizedHtml = $Defang->defang($InputHtml);
  # Callback for custom handling specific HTML tags  
  sub DefangTagsCallback {
    my ($Self, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR, $OutR) = @_;
    # Explicitly defang this tag, eventhough safe
    return DEFANG_ALWAYS if $lcTag eq 'br';
    # Explicitly whitelist this tag, eventhough unsafe
    return DEFANG_NONE if $lcTag eq 'embed';
    # I am not sure what to do with this tag, so process as HTML::Defang normally would
    return DEFANG_DEFAULT if $lcTag eq 'img';
  }
  # Callback for custom handling URLs in HTML attributes as well as style tag/attribute declarations
  sub DefangUrlCallback {
    my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $AttributeHash, $HtmlR) = @_;
    # Explicitly allow this URL in tag attributes or stylesheets
    return DEFANG_NONE if $$AttrValR =~ /safesite.com/i;
    # Explicitly defang this URL in tag attributes or stylesheets
    return DEFANG_ALWAYS if $$AttrValR =~ /evilsite.com/i;
  }
  # Callback for custom handling style tags/attributes
  sub DefangCssCallback {
    my ($Self, $Defang, $Selectors, $SelectorRules, $Tag, $IsAttr) = @_;
    my $i = 0;
    foreach (@$Selectors) {
      my $SelectorRule = $$SelectorRules[$i];
      foreach my $KeyValueRules (@$SelectorRule) {
        foreach my $KeyValueRule (@$KeyValueRules) {
          my ($Key, $Value) = @$KeyValueRule;
          # Comment out any '!important' directive
          $$KeyValueRule[2] = DEFANG_ALWAYS if $Value =~ '!important';
          # Comment out any 'position=fixed;' declaration
          $$KeyValueRule[2] = DEFANG_ALWAYS if $Key =~ 'position' && $Value =~ 'fixed';
        }
      }
      $i++;
    }
  }
  # Callback for custom handling HTML tag attributes
  sub DefangAttribsCallback {
    my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;
    # Change all 'border' attribute values to zero.
    $$AttrValR = '0' if $lcAttrKey eq 'border';
    # Defang all 'src' attributes
    return DEFANG_ALWAYS if $lcAttrKey eq 'src';
    return DEFANG_NONE;
  }
  # Callback for all content between tags (except <style>, <script>, etc)
  sub DefangContentCallback {
    my ($Self, $Defang, $ContentR) = @_;
    $$ContentR =~ s/remove this content//;
  }
This module accepts an input HTML and/or CSS string and removes
    any executable code including scripting, embedded objects, applets, etc.,
    and neutralises any XSS attacks. A whitelist based approach is used which
    means only HTML known to be safe is allowed through. HTML::Defang uses a custom html tag parser. The parser has been
    designed and tested to work with nasty real world html and to try and
    emulate as close as possible what browsers actually do with strange looking
    constructs. The test suite has been built based on examples from a range of
    sources such as http://ha.ckers.org/xss.html and
    http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as
    possible XSS attack scenarios have been dealt with. HTML::Defang can make callbacks to client code when it encounters
    the following: 
  When a specified tag is parsedWhen a specified attribute is parsedWhen a URL is parsed as part of an HTML attribute, or CSS property
    value.When style data is parsed, as part of an HTML style attribute, or as part
      of an HTML <style> tag. The callbacks include details about the current tag/attribute that
    is being parsed, and also gives a scalar reference to the input HTML.
    Querying pos() on the input HTML should indicate where the module is
    with parsing. This gives the client code flexibility in working with
    HTML::Defang. HTML::Defang can defang whole tags, any attribute in a tag, any
    URL that appear as an attribute or style property, or any CSS declaration in
    a declaration block in a style rule. This helps to precisely block the most
    specific unwanted elements in the contents(for example, block just an
    offending attribute instead of the whole tag), while retaining any safe
    HTML/CSS. 
  HTML::Defang->new(%Options)Constructs a new HTML::Defang object. The following options are
    supported: 
  Options 
  tags_to_callbackArray reference of tags for which a call back should be made. If a tag in
      this array is parsed, the subroutine tags_callback() is
    invoked.attribs_to_callbackArray reference of tag attributes for which a call back should be made. If
      an attribute in this array is parsed, the subroutine
      attribs_callback() is invoked.tags_callbackSubroutine reference to be invoked when a tag listed in @$tags_to_callback
      is parsed.attribs_callbackSubroutine reference to be invoked when an attribute listed in
      @$attribs_to_callback is parsed.url_callbackSubroutine reference to be invoked when a URL is detected in an HTML tag
      attribute or a CSS property.css_callbackSubroutine reference to be invoked when CSS data is found either as the
      contents of a 'style' attribute in an HTML tag, or as the contents of a
      <style> HTML tag.content_callbackSubroutine reference to be invoked when standard content between HTML tags
      in found.fix_mismatched_tagsThis property, if set, fixes mismatched tags in the HTML input. By
      default, tags present in the default
      %mismatched_tags_to_fix hash are fixed. This set
      of tags can be overridden by passing in an array reference
      $mismatched_tags_to_fix to the constructor. Any
      opened tags in the set are automatically closed if no corresponding
      closing tag is found. If an unbalanced closing tag is found, that is
      commented out.mismatched_tags_to_fixArray reference of tags for which the code would check for matching
      opening and closing tags. See the property
      $fix_mismatched_tags.contextYou can pass an arbitrary scalar as a 'context' value that's then passed
      as the first parameter to all callback functions. Most commonly this is
      something like '$Self'allow_double_defangIf this is true, then tag names and attribute names which already begin
      with the defang string ("defang_" by default) will have an
      additional copy of the defang string prepended if they are flagged to be
      defanged by the return value of a callback, or if the tag or attribute
      name is unknown.
    The default is to assume that tag names and attribute names
        beginning with the defang string are already made safe, and need no
        further modification, even if they are flagged to be defanged by the
        return value of a callback. Any tag or attribute modifications made
        directly by a callback are still performed.delete_defang_contentNormally defanged tags are turned into comments and prefixed by defang_,
      and defanged styles are surrounded by /* ... */. If this is set to true,
      then defanged content is deleted insteadDebugIf set, prints debugging output. 
  HTML::Defang->new_bodyonly(%Options)Constructs a new HTML::Defang object that has the following implicit
      options 
Basically this is a easy way to remove all html boiler plate
    content and return only the html body content. 
  COMMON
    PARAMETERSA number of the callbacks share the same parameters. These common
      parameters are documented here. Certain variables may have specific
      meanings in certain callbacks, so be sure to check the documentation for
      that method first before referring this section. 
  $contextYou can pass an arbitrary scalar as a 'context' value that's then passed
      as the first parameter to all callback functions. Most commonly this is
      something like '$Self'$DefangCurrent HTML::Defang instance$OpenAngleOpening angle(<) sign of the current tag.$lcTagLower case version of the HTML tag that is currently being parsed.$IsEndTagHas the value '/' if the current tag is a closing tag.$AttributeHashA reference to a hash containing the attributes of the current tag and
      their values. Each value is a scalar reference to the value, rather than
      just a scalar value. You can add attributes (remember to make it a scalar
      ref, eg $AttributeHash{"newattr"} =
      \"newval"), delete attributes, or modify attribute values in
      this hash, and any changes you make will be incorporated into the output
      HTML stream.
    The attribute values will have any entity references decoded
        before being passed to you, and any unsafe values we be re-encoded back
        into the HTML stream. So for instance, the tag:   <div title="<"Hi there <">
    Will have the attribute hash:   { title => \q[<"Hi there <] }
    And will be turned back into the HTML on output:   <div title="<"Hi there <">
    $CloseAngleAnything after the end of last attribute including the closing HTML
      angle(>)$HtmlRA scalar reference to the input HTML. The input HTML is parsed using
      m/\G$SomeRegex/c constructs, so to continue from where HTML:Defang left,
      clients can use m/\G$SomeRegex/c for further processing on the input. This
      will resume parsing from where HTML::Defang left. One can also use the
      pos() function to determine where HTML::Defang left off. This
      combined with the add_to_output() method should give reasonable
      flexibility for the client to process the input.$OutRA scalar reference to the processed output HTML so far. 
  tags_callback($context,
    $Defang, $OpenAngle, $lcTag,
    $IsEndTag, $AttributeHash,
    $CloseAngle, $HtmlR,
    $OutR)If $Defang->{tags_callback} exists, and
      HTML::Defang has parsed a tag preset in
      $Defang->{tags_to_callback}, the above callback
      is made to the client code. The return value of this method determines
      whether the tag is defanged or not. More details below. 
  attribs_callback($context,
    $Defang, $lcTag, $lcAttrKey,
    $AttrVal, $HtmlR,
    $OutR)If $Defang->{attribs_callback} exists, and
      HTML::Defang has parsed an attribute present in
      $Defang->{attribs_to_callback}, the above
      callback is made to the client code. The return value of this method
      determines whether the attribute is defanged or not. More details
    below. 
  Method
    parameters 
  $lcAttrKeyLower case version of the HTML attribute that is currently being
    parsed.$AttrValReference to the HTML attribute value that is currently being parsed.
    See $AttributeHash for details of
        decoding. 
  Return
    values 
  DEFANG_NONEThe current attribute will not be defanged.DEFANG_ALWAYSThe current attribute will be defanged.DEFANG_DEFAULTThe current attribute will be processed normally by HTML:Defang as if
      there was no callback method specified. 
  url_callback($context,
    $Defang, $lcTag, $lcAttrKey,
    $AttrVal, $AttributeHash, $HtmlR,
    $OutR)If $Defang->{url_callback} exists, and
      HTML::Defang has parsed a URL, the above callback is made to the client
      code. The return value of this method determines whether the attribute
      containing the URL is defanged or not. URL callbacks can be made from
      <style> tags as well style attributes, in which case the particular
      style declaration will be commented out. More details below. 
  Method
    parameters 
  $lcAttrKeyLower case version of the HTML attribute that is currently being parsed.
      However if this callback is made as a result of parsing a URL in a style
      attribute, $lcAttrKey will be set to the string
      style, or will be set to undef if this callback is made as a
      result of parsing a URL inside a style tag.$AttrValReference to the URL value that is currently being parsed.$AttributeHashA reference to a hash containing the attributes of the current tag and
      their values. Each value is a scalar reference to the value, rather than
      just a scalar value. You can add attributes (remember to make it a scalar
      ref, eg $AttributeHash{"newattr"} =
      \"newval"), delete attributes, or modify attribute values in
      this hash, and any changes you make will be incorporated into the output
      HTML stream. Will be set to undef if the callback is made due to
      URL in a <style> tag or attribute. 
  Return
    values 
  DEFANG_NONEThe current URL will not be defanged.DEFANG_ALWAYSThe current URL will be defanged.DEFANG_DEFAULTThe current URL will be processed normally by HTML:Defang as if there was
      no callback method specified. 
  css_callback($context,
    $Defang, $Selectors,
    $SelectorRules, $lcTag, $IsAttr,
    $OutR)If $Defang->{css_callback} exists, and
      HTML::Defang has parsed a <style> tag or style attribtue, the above
      callback is made to the client code. The return value of this method
      determines whether a particular declaration in the style rules is defanged
      or not. More details below. 
  Method
    parameters 
  $SelectorsReference to an array containing the selectors in a style tag or
      attribute.$SelectorRulesReference to an array containing the style declaration blocks of all
      selectors in a style tag or attribute. Consider the below CSS:
    
      a { b:c; d:e}
  j { k:l; m:n}
    The declaration blocks will get parsed into the following data
        structure:   [
    [
      [ "b", "c", DEFANG_DEFAULT ],
      [ "d", "e", DEFANG_DEFAULT ]
    ],
    [
      [ "k", "l", DEFANG_DEFAULT ],
      [ "m", "n", DEFANG_DEFAULT ]
    ]
  ]
    So, generally each property:value pair in a declaration is
        parsed into an array of the form   ["property", "value", X]
    where X can be DEFANG_NONE, DEFANG_ALWAYS or DEFANG_DEFAULT,
        and DEFANG_DEFAULT the default value. A client can manipulate this value
        to instruct HTML::Defang to defang this property:value pair. DEFANG_NONE - Do not defang DEFANG_ALWAYS - Defang the style:property value DEFANG_DEFAULT - Process this as if there is no callback
        specified$IsAttrTrue if the currently processed item is a style attribute. False if the
      currently processed item is a style tag. 
  PUBLIC
    METHODS 
  defang($InputHtml,
    \%Opts)Cleans up $InputHtml of any executable code
      including scripting, embedded objects, applets, etc., and defang any XSS
      attacks. 
Returns the cleaned HTML. If fix_mismatched_tags is set, any tags
    that appear in @$mismatched_tags_to_fix that are unbalanced are
    automatically commented or closed. 
  add_to_output($String)Appends $String to the output after the current
      parsed tag ends. Can be used by client code in callback methods to add
      HTML text to the processed output. If the HTML text needs to be defanged,
      client code can safely call HTML::Defang->defang() recursively
      from within the callback. 
  Method
    parameters 
  $StringThe string that is added after the current parsed tag ends. 
  INTERNAL
    METHODSGenerally these methods never need to be called by users of the class,
      because they'll be called internally as the appropriate tags are
      encountered, but they may be useful for some users in some cases. 
  defang_script_tag($OutR,
    $HtmlR, $TagOps, $OpenAngle,
    $IsEndTag, $Tag, $TagTrail,
    $Attributes, $CloseAngle)This method is invoked when a <script> tag is parsed. Defangs the
      <script> opening tag, and any closing tag. Any scripting content is
      also commented out, so browsers don't display them.
    Returns 1 to indicate that the <script> tag must be
        defanged. 
  Method
    parameters 
  $OutRA reference to the processed output HTML before the tag that is currently
      being parsed.$HtmlRA scalar reference to the input HTML.$TagOpsIndicates what operation should be done on a tag. Can be undefined,
      integer or code reference. Undefined indicates an unknown tag to
      HTML::Defang, 1 indicates a known safe tag, 0 indicates a known unsafe
      tag, and a code reference indicates a subroutine that should be called to
      parse the current tag. For example, <style> and <script> tags
      are parsed by dedicated subroutines.$OpenAngleOpening angle(<) sign of the current tag.$IsEndTagHas the value '/' if the current tag is a closing tag.$TagThe HTML tag that is currently being parsed.$TagTrailAny space after the tag, but before attributes.$AttributesA reference to an array of the attributes and their values, including any
      surrouding spaces. Each element of the array is added by 'push' calls like
      below.
    
      push @$Attributes, [ $AttributeName, $SpaceBeforeEquals, $EqualsAndSubsequentSpace, $QuoteChar, $AttributeValue, $QuoteChar, $SpaceAfterAtributeValue ];
    $CloseAngleAnything after the end of last attribute including the closing HTML
      angle(>) 
  defang_style_text($Content,
    $lcTag, $IsAttr, $AttributeHash,
    $HtmlR, $OutR)Defang some raw css data and return the defanged content 
  Method
    parameters 
  $ContentThe input style string that is defanged.$IsAttrTrue if $Content is from an attribute, otherwise
      from a <style> block 
  cleanup_style($StyleString)Helper function to clean up CSS data. This function directly operates on
      the input string without taking a copy. 
  defang_stylerule($SelectorsIn,
    $StyleRules, $lcTag, $IsAttr,
    $AttributeHash, $HtmlR,
    $OutR)Defangs style data. 
  Method
    parameters 
  $SelectorsInAn array reference to the selectors in the style tag/attribute
    contents.$StyleRulesAn array reference to the declaration blocks in the style tag/attribute
      contents.$lcTagLower case version of the HTML tag that is currently being parsed.$IsAttrWhether we are currently parsing a style attribute or style tag.
      $IsAttr will be true if we are currently parsing a
      style attribute.$HtmlRA scalar reference to the input HTML.$OutRA scalar reference to the processed output so far. 
  defang_attributes($OutR,
    $HtmlR, $TagOps, $OpenAngle,
    $IsEndTag, $Tag, $TagTrail,
    $Attributes, $CloseAngle)Defangs attributes, defangs tags, does tag, attrib, css and url
    callbacks. 
  Method
    parametersFor a description of the method parameters, see documentation of
      defang_script_tag() method 
  cleanup_attribute($AttributeString)Helper function to cleanup attributes <http://mailtools.anomy.net/>,
    <http://htmlcleaner.sourceforge.net/>, HTML::StripScripts,
    HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber Kurian Jose Aerthail <cpan@kurianja.fastmail.fm>. Thanks to
    Rob Mueller <cpan@robm.fastmail.fm> for initial code, guidance and
    support and bug fixes. Copyright (C) 2003-2013 by FastMail Pty Ltd This library is free software; you can redistribute it and/or
    modify it under the same terms as Perl itself. 
  Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc.
 |