Skip to content

Multilingual automatic text summarizer using statistical approach and extraction

License

Notifications You must be signed in to change notification settings

kariminf/allsummarizer

Repository files navigation

AllSummarizer

Project Type License GitHub release Github All Releases

A research project implementation for automatic text summarization. AllSummarizer uses an extractive method to generate the summary ; Each sentence is scored based on some criteria, reordered, then if it scores among the first ones it will be included in the summary.

For more documentation check this

You can find more about the method in the paper:

@inproceedings {13-aries-al,
	author = {Aries, Abdelkrime and Oufaida, Houda and Nouali, Omar},
	title = {Using clustering and a modified classification algorithm for automatic text summarization},
	series = {Proc. SPIE},
	volume = {8658},
	number = {},
	pages = {865811-865811-9},
	year = {2013},
	doi = {10.1117/12.2004001},
	URL = { http://dx.doi.org/10.1117/12.2004001}
}

Also, the participation of the system at MultiLing 2015 workshop:

@Inbook{15-aries-al,
  author = {Aries, Abdelkrime
            and Zegour, Eddine Djamel
            and Hidouci, Walid Khaled},
  chapter = {AllSummarizer system at MultiLing 2015: Multilingual single and multi-document summarization},
  title = {Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue},
  year = {2015},
  publisher = {Association for Computational Linguistics},
  pages = {237--244},
  location = {Prague, Czech Republic},
  url = {http://aclweb.org/anthology/W15-4634}
}
@inproceedings{18-aries-al,
	author    = {Abdelkrime Aries and
	Djamel Eddine Zegour and
	Walid{-}Khaled Hidouci},
	title     = {Exploring Graph Bushy Paths to Improve Statistical Multilingual Automatic Text Summarization},
	booktitle = {Computational Intelligence and Its Applications - 6th {IFIP} {TC}
	5 International Conference, {CIIA} 2018, Oran, Algeria, May 8-10,
	2018, Proceedings},
	pages     = {78--89},
	year      = {2018},
	url       = {https://doi.org/10.1007/978-3-319-89743-1\_8},
	doi       = {10.1007/978-3-319-89743-1\_8},
	timestamp = {Sat, 05 May 2018 23:05:32  0200},
	biburl    = {https://dblp.org/rec/bib/conf/ciia/AriesZH18},
	bibsource = {dblp computer science bibliography, https://dblp.org}
}

Dependencies:

This project is dependent to other projects:

  • KToolJa: for file management and plugins
  • LangPi: for text preprocessing; which depends on other libraries

Preprocessing plugins are in the folder: "preProcess". For Hebrew and Tai preprocessing tools, check LangPi releases. Those two plugins are not Apache2 licensed.

Command line usage

To execute from command line:

  • Jar file: java -jar <jar_name> options
  • Class: java kariminf.as.ui.MonoDoc options

input/output options:

  • -i <input_file>: it must be a file or a folder if it is multidocument or variant inputs
  • -o <output_file>: it must be a file or a folder if it is multidocument or there is multiple output lengths, feature combinations or thresholds
  • -v: variant inputs; a folder that contains files or folders to be summarized.

summary options:

sumary unit:

  • -b: we use Bytes to specify the summary size.
  • -c: we use characters to specify the summary size.
  • -w: we use words to specify the summary size.
  • -s: we use sentences to specify the summary size.

sumary length:

  • -n : defines the number of units to be extracted.
  • -r : ratio from 1 to 100% defines the percentage of units to be extracted. you can specify more than one length, by separating the lengths with semicolons

summarizer options:

  • -f : the features used to score the sentences. the features are separated by commas; for example: tfu,pos for multiple combinations, we use semicolons; for example: tfu,pos;tfb,len
  • -t : a number from 0 to 100 to specify the threshold of clustering. for multiple thresholds, we use semicolons; for example: 5;50

To get help, use -h

Examples of command line

Suppose we have a folder for inputs called "exp":

exp
├── multi
│   ├── M001
│   │   ├── M0010.english
│   │   ├── M0011.english
│   │   └── M0012.english
│   └── M002
│       ├── M0020.english
│       ├── M0021.english
│       └── M0022.english
└── single
    ├── doc1.txt
    └── doc2.txt

single document examples:

the command:

-i "exp/single" -o "exp/output" -l en -t "5-15:5" -n "100;200" -c -f "tfu,pos;tfb,rleng" -v

gives these files:

doc1.txt_0.05_Pos-TFU_100c.txt    doc1.txt_0.1_Pos-TFU_100c.txt     doc2.txt_0.15_Pos-TFU_100c.txt
doc1.txt_0.05_Pos-TFU_200c.txt    doc1.txt_0.1_Pos-TFU_200c.txt     doc2.txt_0.15_Pos-TFU_200c.txt
doc1.txt_0.05_RLeng-TFB_100c.txt  doc1.txt_0.1_RLeng-TFB_100c.txt   doc2.txt_0.15_RLeng-TFB_100c.txt
doc1.txt_0.05_RLeng-TFB_200c.txt  doc1.txt_0.1_RLeng-TFB_200c.txt   doc2.txt_0.15_RLeng-TFB_200c.txt
doc1.txt_0.15_Pos-TFU_100c.txt    doc2.txt_0.05_Pos-TFU_100c.txt    doc2.txt_0.1_Pos-TFU_100c.txt
doc1.txt_0.15_Pos-TFU_200c.txt    doc2.txt_0.05_Pos-TFU_200c.txt    doc2.txt_0.1_Pos-TFU_200c.txt
doc1.txt_0.15_RLeng-TFB_100c.txt  doc2.txt_0.05_RLeng-TFB_100c.txt  doc2.txt_0.1_RLeng-TFB_100c.txt
doc1.txt_0.15_RLeng-TFB_200c.txt  doc2.txt_0.05_RLeng-TFB_200c.txt  doc2.txt_0.1_RLeng-TFB_200c.txt

the command:

-i "exp/single/doc1.txt" -o "exp/output" -l en -t 5 -r "5;10" -c -f "tfu,pos"

gives these files:

doc1.txt_0.05_Pos-TFU_10%c.txt  doc1.txt_0.05_Pos-TFU_5%c.txt

multi-document examples:

the command:

-i "exp/multi" -o "exp/output" -l en -t 5 -r "5;10" -c -f "tfu,pos" -v -m

gives these files:

M001_0.05_Pos-TFU_10%c.txt  M001_0.05_Pos-TFU_5%c.txt  
M002_0.05_Pos-TFU_10%c.txt  M002_0.05_Pos-TFU_5%c.txt

License

Copyright (C) 2012-2017 Abdelkrime Aries

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.