GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  COMBIG.PL (1)

.ds Aq ’

NAME

combig.pl - Combine frequency counts to determine co-occurrence

CONTENTS

SYNOPSIS

Combines (sums) the frequency counts of bigrams made up of the same pair of words in either possible order. It will count the number of time two words occur together in a bigram regardless of which one comes first.

DESCRIPTION

    USAGE



 combig.pl [OPTIONS] BIGRAM



INPUT PARAMETERS
o BIGRAM

Specify a file of bigram counts created by NSP programs count.pl. The entries in BIGRAM will be formatted as follows:



        word1<>word2<>n11 n1p np1



Here, word1 is followed by word2 n11 times. word1 occurs as the 1st word in total n1p bigrams and word2 occurs as the 2nd word in np1 bigrams.

o OPTIONS



 --help



Displays this message.



 --version



Displays the version information.

OUTPUT

combig.pl produces a count of the number of times two words make up a bigram in either order, whereas count.pl produces counts for a single fixed ordering. In other words, combig.pl combines the counts of bigrams that are composed of the same words but in reverse order. While the BIGRAM shows pairs of words forming bigrams, output of combig will show the pairs of words that are co-occurrences or that co-occur irrespective of their order.

e.g. if bigrams



        word1<>word2<>n11 n1p np1
and
        word2<>word1<>m11 m1p mp1



are found in BIGRAM file, then combig.pl treats these as a single unordered bigram



        word1<>word2<>n11+m11 n1p+mp1 np1+m1p



where the new bigram will show a combined contingency table in which the order of words doesn’t matter.



                        word2           ~word2
        ___________________________________________________________
         word1  |       n11+m11         n12+m21     | n1p+mp1
                |                                   |
        ~word1  |       n21+m12         n22+m22-n   | n2p+mp2-n
                ___________________________________________________
                        np1+m1p         np2+m2p-n   |  n
here the entry



o (word1,word2)=n11+m11

shows the number of bigrams having both word1 and word2 in either order

i.e. word1<>word2 + word2<>word1

o (word1,~word2)=n12+m21

shows the number of bigrams having word1 but not word2 at either position

i.e. word1<>~word2 + ~word2<>word1

o (~word1,word2)=n21+m12

shows the number of bigrams having word2 but not word1 at either position

i.e. ~word1<>word2 + word2<>~word1

o (~word1,~word2)=n22+m22

shows the number of bigrams not having word1 nor word2 at either position

i.e. ~word1<>~word2 + ~word2<>~word1 - n

where n=total number of bigrams

The mathematical proof of how the cell counts in the above contingency table are counted is explained in section Proof.

When a bigram appears in only one order i.e.

word1<>word2<>n11 n1p np1

appears but

word2<>word1<>m11 m1p mp1

does not, then the combined bigram will be same as the original bigram that appears. Or in other words,

word1<>word2<>n11 n1p np1

is displayed as it is.

    PROOF OF CORRECTNESS

A bigram word1<>word2<>n11 n1p np1 represents a contingency table



                  word2         ~word2
                --------------------------------------
        word1   n11     |       n12     |       n1p    
                        |               |
        ~word1  n21     |       n22     |       n2p
                --------------------------------------
                np1     |       np2     |       n



while a bigram word2<>word1<>m11 m1p mp1 represents a contingency table



                  word1         ~word1
                -------------------------------------- 
        word2   m11     |       m12     |       m1p
                        |               |
        ~word2  m21     |       m22     |       m2p
                --------------------------------------
                mp1     |       mp2     |       n



Here,



 n11+n12+n21+n22 = n



Also,



 m11+m12+m21+m22 = n



combig.pl combines bigram counts into a single order independant word pair



 word1<>word2<>n11+m11 n12+m21 n21+m12



And the corresponding contingency table will be shown as



                        word2           ~word2 
                -----------------------------------------
        word1   n11+m11   |     n12+m21   |     n1p+mp1
                          |               |
        ~word1  n21+m12   |     n22+m22-n |     n2p+mp2
                -----------------------------------------
                np1+m1p   |     np2+m2p   |     n



The first cell (n11+m11) shows the #bigrams having word1 and word2 (irrespective of their positions) i.e. word1<>word2 or word2<>word1 which is n11+m11.

The second cell (n12+m21) shows the #bigrams having word1 but not word2 at any position i.e. word1<>~word2 or ~word2<>word1 which is n12+m21.

The third cell (n21+m12) shows the #bigrams having word2 but not word1 at any position i.e. ~word1<>word2 or word2<>~word1 which is n21+m12.

The fourth cell (m22+n22-n) shows the #bigrams not having word1 nor word2 at any position which



 = n - (n11+m11) - (n12+m21) - (n21+m12)

 = n - (n11+n12+n21) - (m11+m12+m21)

 = n - (n-n22) - (n-m22)

 = n22 + m22 - n



Alternative proof -



 n22 = m11 + m12 + m21 + X      (a)

 m22 = n11 + n12 + n21 + X      (b)



where X = #bigrams not having either word1 or word2.

as both n22 and m22 have some terms in common which show the bigrams not having either word1 or word2. But,



 m11+m12+m21 = n - m22



Substituting this in eqn (a)



 n22 = n - m22 + X



Or



 X = n22 + m22 - n



Or add (a) and (b) to get



 n22+m22 = (n11+m11) + (n12+m21) + (n21+m12) + 2X



rearranging terms,



 n22+m22 = (n11+n12+n21) + (m11+m12+m21) + 2X



but



 n11+n12+n21 = n - n22 and

 m11+m12+m21 = n - m22



Hence,



 n22+m22 = (n-n22) + (n-m22) + 2X

 2(n22+m22-n) = 2X



Or



 (n22+m22-n) = X



which is the fourth cell count.

    Viewing Bigrams as Graphs

In bigrams, the order of words is important. Bigram word1<>word2 shows that word2 follows word1. Bigrams can be viewed as a directed graph where a bigram word1<>word2 will represent a directed edge e from initial vertex word1 to terminal vertex word2(word1->word2).

In this case,

n11, which is the number of times bigram word1<>word2 occurs, becomes the weight of the directed edge word1->word2.

n1p, which is the number of bigrams having word1 at 1st position, becomes the out degree of vertex word1

and

np1, which is the number of bigrams having word2 at 2nd position, becomes the in degree of vertex word2

combig.pl creates a new list of word pairs from these bigrams such that the order of words can be ignored. Viewed another way, it converts the directed graph of given bigrams to an undirected graph showing new word pairs.

A pair say



        word1<>word2<>n11 n1p np1



shown in the output of combig can be viewed as an undirected edge joining word1 and word2 having weight n11. If we count the degree of vertex word1 it will be n1p and degree of vertex word2 will be np1.

AUTHORS



 Amruta Purandare, pura0010@d.umn.edu
 Ted Pedersen, tpederse@d.umn.edu

 Last update 03/22/04 by ADP



This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784).

BUGS

SEE ALSO

http://www.d.umn.edu/~tpederse/nsp.html

COPYRIGHT

Copyright (c) 2004, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Search for    or go to Top of page |  Section 1 |  Main Index


perl v5.20.3 COMBIG (1) 2008-03-24

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.