SlideShare a Scribd company logo
1 of 47
Download to read offline
Processing XML
A rewriting system approach

           Alberto Simões
 alberto.simoes@eu.ipp.pt


  Portuguese Perl Workshop – 2010




       Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Motivation and Goals


     XML is usually generated from structured information:
         databases, spreadsheets, forms, etc.

     but it can be generated from unstructured
     (or poorly-structured data):
         textual documents, domain specific languages;

 Question arises:
 How to produce XML documents from textual documents?
     write a parser (natural language, domain specific, etc);
     produce XML by rewriting the textual document!




                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Hows does textual rewriting works?


    write rewriting rules:

                rule ∼ pattern × restriction × action
                     =


         pattern a regular (or irregular) expression that should
                 be textually matched;
      restriction conditional code that checks whether the rule
                  should be applied;
          action a piece of code (or simply a string) that
                 produces text that should replace the
                 originally matched text;



                      Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Are there text rewriting tools?


 For this work we used Text::RewriteRules:

     written in Perl:
          Perl regular expression engine power;
          Reflexive language (code can be generated on the fly);
     supports different rewriting approaches:
          Fixed-point rewriting approach;
          Sliding-cursor rewriting approach;
          Lexical analyzer approach;

     home-developed;




                        Alberto Simões   Processing XML: a rewriting system approach
Fixed-point rewriting approach

 Algorithm
     easy to understand;
     a sequence of rules that are applied by order;
     first rule is applied, and following rules are only applied if
     there is no previous rule that can be applied;
     it might happen that a rule changes the document in a way
     that a previous rule will be applied again;
     the process ends when there are no rules that can be
     applied (or if a specific rule forces the system to end);

 Code example: anonymization of emails
   RULES anonymize
   w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
   ENDRULES


                       Alberto Simões   Processing XML: a rewriting system approach
Fixed-point rewriting approach

 Algorithm
     easy to understand;
     a sequence of rules that are applied by order;
     first rule is applied, and following rules are only applied if
     there is no previous rule that can be applied;
     it might happen that a rule changes the document in a way
     that a previous rule will be applied again;
     the process ends when there are no rules that can be
     applied (or if a specific rule forces the system to end);

 Code example: anonymization of emails
   RULES anonymize
   w+(.w+)*@w+.w+(.w+)*==>[[hidden email]]
   ENDRULES


                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
 Algorithm
     the cursor is placed in the beginning of the string;
     patterns are matched if they occur right after the cursor;
     if a rule is applied, the cursor is placed after that region;
     if no rule matches, the cursor moves ahead one character;
     process ends when cursor reaches the end of the string;
     it will never rewrite text that was already rewritten.
 Code example: brute force translation
 RULES/m translate
 (w+)=e=> $translation{$1} !! exists($translation{$1})
 ENDRULES

 Example
   _ latest train
   último _ train
   último combóio _
                       Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Valid Rewriting Rules

 Different approaches have different possible rules. . .
 but the most relevant rules are:
         ==> simple pattern substitution: left hand side includes
             a Perl regular expression and right hand side
             includes the string that will replace the match;
        =e=> similar to the previous one, but right hand side
             includes Perl code to be evaluated. The result will
             be used to replace the match;
  =begin=> without a left hand side, the right hand side code is
           executed before the rewrite starts;
     =end=> without a right hand side, when the left hand side
            pattern matches quits the rewrite system;
 they can include a restriction block (!!) at the right of the action.


                        Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into XML



 How to produce XML from weak-structured data?
     write a parser;
     or rewrite the data step-by-step into XML!



 Two case studies:
     Rewriting a dictionary in textual format into TEI;
     Rewriting a XML DSL authoring tool into XML;




                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into XML



 How to produce XML from weak-structured data?
     write a parser;
     or rewrite the data step-by-step into XML!



 Two case studies:
     Rewriting a dictionary in textual format into TEI;
     Rewriting a XML DSL authoring tool into XML;




                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI
Rewrite this. . .                       . . . into this!
*Cachimbo*,                             <entry id="cachimbo">
_m._                                    <form><orth>Cachimbo</orth></form>
Apparelho de fumador, composto d..      <sense>
Peça de ferro, em que entra o es..      <gramGrp>m.</gramGrp>
Buraco, em que se encaixa a vela..      <def>
* _Bras. de Pernambuco._                Apparelho de fumador, composto d..
Bebida, preparada com aguardente..      Peça de ferro, em que entra o es..
* _Pl. Gír._                            Buraco, em que se encaixa a vela..
Pés.                                    </def>
(Do químb. _quixima_)                   </sense>
                                        <sense ast="1">
                                        <usg type="geo">Bras. de Pernamb..
                                        <def>
                                        Bebida, preparada com aguardente..
                                        </def>
                                        </sense>
                                        <sense ast="1"><gramGrp>Pl.</gra..
                                        <usg type="style">Gír.</usg>
                                        <def>
                                        Pés.
                                        </def>
                                        </sense>
                                        <etym ori="químb">(Do químb. _qu..
                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 This rewrite was all based on:
     a few tables (grammatical and usage strings);
          entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph
          entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

     rewrite the few mark-up into better XML structure;
 ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_


     rewrite the new XML structure to detect and annotate a
     more complex structure;
 <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g


     detect and correct wrong XML elements.
 </form></sense>==></form>
 </form></def>n</sense>==></form>


                       Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 Case study conclusions:
     flexible tool;
     works on big files:
         Text file is 13 MB;
         Output XML is 30 MB;
         Process takes about nine minutes!
     we event rewrote XML into XML.



                Hey!! XML is text!!
              How can we rewrite it!?

                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting Text into TEI

 Case study conclusions:
     flexible tool;
     works on big files:
         Text file is 13 MB;
         Output XML is 30 MB;
         Process takes about nine minutes!
     we event rewrote XML into XML.



                Hey!! XML is text!!
              How can we rewrite it!?

                      Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML



    different from the usual DOM or SAX oriented approaches;
    looks to XML as text, non structured data;
    rewrite can be done:
        as any other text write system;
        taking advantage of irregular expressions.



     Irregular expressions? Are you kidding?




                     Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Not so regular expressions

 Perl has a powerful regular expression engine:
     regular expressions can define capture zones:
     small pieces of the match that can be used later;
     regular expressions can define look-ahead or look-behind:
     check the context of the matching zone;
     since Perl 5.10, regular expressions can be recursive:
     regular expression that depends on themself.
     my $parens = qr/(((?:[^()]++|(?-1))*+))/;

 For XML, we defined two classes:
 [[:XML:]] matches any well formed XML fragment;
 [[:XML(tag):]] matches a XML fragment with a specific
          root element;


                         Alberto Simões   Processing XML: a rewriting system approach
Rewriting XML

 As a simple example, we can remove duplicate translation units
 in a translation memory file:
 Code example
 RULES/m duplicates
 ([[:XML(tu):]])==>!!duplicate($1)
 ENDRULES

 sub duplicate {
   my $tu = shift;
   my $tumd5 = md5(dtstring($tu,
                            -default => sub{$c}));
   return 1 if exists $visited{$tumd5};
   $visited{$tumd5}++
   return 0;
 }


                      Alberto Simões   Processing XML: a rewriting system approach
Conclusions


    The rewriting approach is:
        flexible;
        powerful;
        easy to learn;
        grows quickly;
        big systems can be difficult to maintain;
    The Perl regular engine:
        makes it easy to match anything;
        almost supports full grammars;
        makes it possible to define block structures;

    So, it can be applied to XML easily!




                     Alberto Simões   Processing XML: a rewriting system approach
Thank you




               Thank You!



              Alberto Simões
        alberto.simoes@eu.ipp.pt




              Alberto Simões   Processing XML: a rewriting system approach

More Related Content

Similar to Processing XML: a rewriting system approach

Regular expressions
Regular expressionsRegular expressions
Regular expressionsRaghu nath
 
Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regexYongqiang Li
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expressionGagan019
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Avelin Huo
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysisSudhaa Ravi
 
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...Novell
 
chapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture noteschapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture noteskavitamittal18
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBryan O'Sullivan
 
How does intellisense work?
How does intellisense work?How does intellisense work?
How does intellisense work?Adam Friedman
 
What is the deal with Elixir?
What is the deal with Elixir?What is the deal with Elixir?
What is the deal with Elixir?George Coffey
 
09 string processing_with_regex copy
09 string processing_with_regex copy09 string processing_with_regex copy
09 string processing_with_regex copyShay Cohen
 
COMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptxCOMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptxRossy719186
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 

Similar to Processing XML: a rewriting system approach (20)

Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Parsing
ParsingParsing
Parsing
 
xml2tex at TUG 2014
xml2tex at TUG 2014xml2tex at TUG 2014
xml2tex at TUG 2014
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regex
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Lexical Analyzers and Parsers
Lexical Analyzers and ParsersLexical Analyzers and Parsers
Lexical Analyzers and Parsers
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysis
 
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
Working with XSLT, XPath and ECMA Scripts: Make It Simpler with Novell Identi...
 
chapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture noteschapter7.ppt java programming lecture notes
chapter7.ppt java programming lecture notes
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore Haskell
 
How does intellisense work?
How does intellisense work?How does intellisense work?
How does intellisense work?
 
What is the deal with Elixir?
What is the deal with Elixir?What is the deal with Elixir?
What is the deal with Elixir?
 
09 string processing_with_regex copy
09 string processing_with_regex copy09 string processing_with_regex copy
09 string processing_with_regex copy
 
1._Introduction_.pptx
1._Introduction_.pptx1._Introduction_.pptx
1._Introduction_.pptx
 
Shell script-sec
Shell script-secShell script-sec
Shell script-sec
 
COMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptxCOMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptx
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
project present
project presentproject present
project present
 

More from Alberto Simões

Language Identification: A neural network approach
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approachAlberto Simões
 
Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryAlberto Simões
 
Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationAlberto Simões
 
EMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesAlberto Simões
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAlberto Simões
 
Aula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAlberto Simões
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAlberto Simões
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAlberto Simões
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with PerlAlberto Simões
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaAlberto Simões
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaAlberto Simões
 

More from Alberto Simões (20)

Source Code Quality
Source Code QualitySource Code Quality
Source Code Quality
 
Language Identification: A neural network approach
Language Identification: A neural network approachLanguage Identification: A neural network approach
Language Identification: A neural network approach
 
Google Maps JS API
Google Maps JS APIGoogle Maps JS API
Google Maps JS API
 
Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionary
 
Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry Translation
 
EMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized Dictionaries
 
Modelação de Dados
Modelação de DadosModelação de Dados
Modelação de Dados
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de Sequência
 
Aula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de Atividade
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de Requisitos
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de Informação
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with Perl
 
PLN em Perl
PLN em PerlPLN em Perl
PLN em Perl
 
Classification Systems
Classification SystemsClassification Systems
Classification Systems
 
Redes de Pert
Redes de PertRedes de Pert
Redes de Pert
 
Dancing Tutorial
Dancing TutorialDancing Tutorial
Dancing Tutorial
 
Sistemas de Numeração
Sistemas de NumeraçãoSistemas de Numeração
Sistemas de Numeração
 
Álgebra de Boole
Álgebra de BooleÁlgebra de Boole
Álgebra de Boole
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução Automática
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução Automática
 

Recently uploaded

Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 

Recently uploaded (20)

Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 

Processing XML: a rewriting system approach

  • 1. Processing XML A rewriting system approach Alberto Simões alberto.simoes@eu.ipp.pt Portuguese Perl Workshop – 2010 Alberto Simões Processing XML: a rewriting system approach
  • 2. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 3. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 4. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 5. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 6. Motivation and Goals XML is usually generated from structured information: databases, spreadsheets, forms, etc. but it can be generated from unstructured (or poorly-structured data): textual documents, domain specific languages; Question arises: How to produce XML documents from textual documents? write a parser (natural language, domain specific, etc); produce XML by rewriting the textual document! Alberto Simões Processing XML: a rewriting system approach
  • 7. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 8. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 9. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 10. Hows does textual rewriting works? write rewriting rules: rule ∼ pattern × restriction × action = pattern a regular (or irregular) expression that should be textually matched; restriction conditional code that checks whether the rule should be applied; action a piece of code (or simply a string) that produces text that should replace the originally matched text; Alberto Simões Processing XML: a rewriting system approach
  • 11. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 12. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 13. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 14. Are there text rewriting tools? For this work we used Text::RewriteRules: written in Perl: Perl regular expression engine power; Reflexive language (code can be generated on the fly); supports different rewriting approaches: Fixed-point rewriting approach; Sliding-cursor rewriting approach; Lexical analyzer approach; home-developed; Alberto Simões Processing XML: a rewriting system approach
  • 15. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 16. Fixed-point rewriting approach Algorithm easy to understand; a sequence of rules that are applied by order; first rule is applied, and following rules are only applied if there is no previous rule that can be applied; it might happen that a rule changes the document in a way that a previous rule will be applied again; the process ends when there are no rules that can be applied (or if a specific rule forces the system to end); Code example: anonymization of emails RULES anonymize w+(.w+)*@w+.w+(.w+)*==>[[hidden email]] ENDRULES Alberto Simões Processing XML: a rewriting system approach
  • 17. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 18. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 19. Sliding-cursor rewriting approach Algorithm the cursor is placed in the beginning of the string; patterns are matched if they occur right after the cursor; if a rule is applied, the cursor is placed after that region; if no rule matches, the cursor moves ahead one character; process ends when cursor reaches the end of the string; it will never rewrite text that was already rewritten. Code example: brute force translation RULES/m translate (w+)=e=> $translation{$1} !! exists($translation{$1}) ENDRULES Example _ latest train último _ train último combóio _ Alberto Simões Processing XML: a rewriting system approach
  • 20. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 21. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 22. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 23. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 24. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 25. Valid Rewriting Rules Different approaches have different possible rules. . . but the most relevant rules are: ==> simple pattern substitution: left hand side includes a Perl regular expression and right hand side includes the string that will replace the match; =e=> similar to the previous one, but right hand side includes Perl code to be evaluated. The result will be used to replace the match; =begin=> without a left hand side, the right hand side code is executed before the rewrite starts; =end=> without a right hand side, when the left hand side pattern matches quits the rewrite system; they can include a restriction block (!!) at the right of the action. Alberto Simões Processing XML: a rewriting system approach
  • 26. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 27. Rewriting Text into XML How to produce XML from weak-structured data? write a parser; or rewrite the data step-by-step into XML! Two case studies: Rewriting a dictionary in textual format into TEI; Rewriting a XML DSL authoring tool into XML; Alberto Simões Processing XML: a rewriting system approach
  • 28. Rewriting Text into TEI Rewrite this. . . . . . into this! *Cachimbo*, <entry id="cachimbo"> _m._ <form><orth>Cachimbo</orth></form> Apparelho de fumador, composto d.. <sense> Peça de ferro, em que entra o es.. <gramGrp>m.</gramGrp> Buraco, em que se encaixa a vela.. <def> * _Bras. de Pernambuco._ Apparelho de fumador, composto d.. Bebida, preparada com aguardente.. Peça de ferro, em que entra o es.. * _Pl. Gír._ Buraco, em que se encaixa a vela.. Pés. </def> (Do químb. _quixima_) </sense> <sense ast="1"> <usg type="geo">Bras. de Pernamb.. <def> Bebida, preparada com aguardente.. </def> </sense> <sense ast="1"><gramGrp>Pl.</gra.. <usg type="style">Gír.</usg> <def> Pés. </def> </sense> <etym ori="químb">(Do químb. _qu.. Alberto Simões Processing XML: a rewriting system approach
  • 29. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 30. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 31. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 32. Rewriting Text into TEI This rewrite was all based on: a few tables (grammatical and usage strings); entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euph entries domains: Agr Anat Anthrop Apicult Arith Artilh Archit rewrite the few mark-up into better XML structure; ((* )?_([^_]|_[^_]{1,5}_)+_( *)?)n=e=>$a=$1;end_def.end_sense.start_ rewrite the new XML structure to detect and annotate a more complex structure; <gramGrp>([^<]*)s**s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast="1"".g detect and correct wrong XML elements. </form></sense>==></form> </form></def>n</sense>==></form> Alberto Simões Processing XML: a rewriting system approach
  • 33. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 34. Rewriting Text into TEI Case study conclusions: flexible tool; works on big files: Text file is 13 MB; Output XML is 30 MB; Process takes about nine minutes! we event rewrote XML into XML. Hey!! XML is text!! How can we rewrite it!? Alberto Simões Processing XML: a rewriting system approach
  • 35. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 36. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 37. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 38. Rewriting XML different from the usual DOM or SAX oriented approaches; looks to XML as text, non structured data; rewrite can be done: as any other text write system; taking advantage of irregular expressions. Irregular expressions? Are you kidding? Alberto Simões Processing XML: a rewriting system approach
  • 39. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 40. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 41. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 42. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 43. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 44. Not so regular expressions Perl has a powerful regular expression engine: regular expressions can define capture zones: small pieces of the match that can be used later; regular expressions can define look-ahead or look-behind: check the context of the matching zone; since Perl 5.10, regular expressions can be recursive: regular expression that depends on themself. my $parens = qr/(((?:[^()]++|(?-1))*+))/; For XML, we defined two classes: [[:XML:]] matches any well formed XML fragment; [[:XML(tag):]] matches a XML fragment with a specific root element; Alberto Simões Processing XML: a rewriting system approach
  • 45. Rewriting XML As a simple example, we can remove duplicate translation units in a translation memory file: Code example RULES/m duplicates ([[:XML(tu):]])==>!!duplicate($1) ENDRULES sub duplicate { my $tu = shift; my $tumd5 = md5(dtstring($tu, -default => sub{$c})); return 1 if exists $visited{$tumd5}; $visited{$tumd5}++ return 0; } Alberto Simões Processing XML: a rewriting system approach
  • 46. Conclusions The rewriting approach is: flexible; powerful; easy to learn; grows quickly; big systems can be difficult to maintain; The Perl regular engine: makes it easy to match anything; almost supports full grammars; makes it possible to define block structures; So, it can be applied to XML easily! Alberto Simões Processing XML: a rewriting system approach
  • 47. Thank you Thank You! Alberto Simões alberto.simoes@eu.ipp.pt Alberto Simões Processing XML: a rewriting system approach