| Font + -
You are here:   HomeProgrammes OfferedPostgraduateMSc. (IT Technopreneurship)Research

Research

Category: Research

Research Review: An Approach for Automatic Topic Detection and Recognition Using Term Frequency-Inverse Document Frequency and K-Nearest Neighbour Algorithms
2015-01-30
Time 1000 until 1100
Meeting Room 7th Floor
Ammar Ismael Kadhim
Assoc. Prof. Dr. Cheah Yu-N
Dr. Nurul Hashimah Ahamed Hassain Malim
Topic detection and recognition provide a lot of significance for social network users due to their vital role in user trends analysis. Moreover, every new wave of outbreak reveals rapid evolution in terms of sophistication, detection, speed, and damage through searching process to detect the various topics. Unfortunately, the current topic detection research has not seen the same pace of advancement. Most of topic detection are unable to deal intelligently with different topics such Politics, Education, Health, Marketing, Music, News & Media, Recreation & Sports, Computers & Technology, Pets, Food, Family and other. In this study, a comprehension for topic detection and their contents is identified. The main objective of the study is to detect and recognize the topics into different categories that are classified into one or more subjects that was predefined of classes based on their contents using five stages. Several algorithms are used for applying and implementing for each stage. The first stage of topic detection is to prepare text documents by removing the non-informative features; the second stage is to find the statistical language modeling which involves word co-occurrence, statistical bag of words and mutual information; the third stage is to reduce high dimension to lower dimension which involves features extraction using Boolean and TF-IDF (term frequency-inverse document frequency) weighting methods and features selection using singular values decomposition (SVD) and the cosine score similarity; finally machine leaning techniques consist of two different methods as: unsupervised machine learning is used to collect data into one or more clusters using k-means algorithm and supervised machine learning is used to classify the topics into different topics using k-nearest neighbors algorithm. Four sets of dataset with varying sizes of documents were used in this study. The first set with the Reuters-21578 text categorization test collection, the second set with BBC news and BBC sport and data collection on Twitter using application-programming interface (API). The experimental results show that the accuracy for BBC News and BBC Sport are 94.67 and 95.00 respectively by using k-means. While the accuracy approximately is 96.2 by using KNN for the same dataset. These results indicate that the supervised machine learning presents higher efficiency and accuracy in revealing topic detection and their contents.
N/A
none
2014/15
1
Mohd Redzuan Asmi
 
Category: Research

Research Review: Enhanced Approaches for Sindhi and Multi-Script Optical Character Recognition
2014-12-10
Time 1000 until 1130
Meeting Room 7th Floor
Dil Nawaz
Prof. Dr. Abdullah Zawawi Hj Talib
N/A
Optical Character Recognition (OCR) system which is an integral part of machine vision and image processing, biomedical imaging, language processing and speech recognition poses many challenging problems. The non-cursive OCR systems have achieved perfection whereas the OCRs for cursive languages still need attention. OCR work on Sindhi OCR is still in infancy and there is no complete OCR for the language which is based on the Arabic script and spoken by over 60 million people in Pakistan and other parts of the world. There was no text image database available for testing and training of the Sindhi characters and most text image databases are created for only one single script. In this research, a study is made on the challenges posed by the Sindhi script with respect to its OCR. A huge database containing 4 billion words and 15 billion characters is created for testing and training of Sindhi script with the help of custom built software together with a multi-script database for multiple scripts suitable for Sindhi script and multi-script OCR on a single platform comprising multi-billion words and characters for 84 languages. Sindhi has the largest extension of the original Arabic script among the languages adopting the Arabic script. Therefore, in this research, an enhanced segmentation algorithm and feature extraction algorithms are proposed for Sindhi which can also be used for other scripts. The segmentation algorithm based on energy level produce good results and also segments other script characters. Zoning based feature extraction applied as individual and combined approach for extracting features from Sindhi characters and other scripts. An integrated Sindhi OCR and multi-script OCR is developed in this research. The enhanced segmentation algorithm and enhanced feature extraction algorithm produced good results on Sindhi and multi-script characters. The integrated OCR for Sindhi obtained 89% of recognition rate and 90.33% to 99.90% recognition rate for some of other scripts tested on selected subset of the database created with the custom built software. A working software for recognizing some of the languages and scripts which can be easily extended to more scripts recognition. The database size for Sindhi and other scripts can be increased easily by adding more data and creating images for testing and training of these scripts.
N/A
none
2014/15
1
N/A
 
Category: Research

Proposal Review: Performance and Reliability Awareness Schema for Nand Flash Memory Based Solid State Disk
2015-01-26
Time 1000 until 1100
Meeting Room 7th Floor
Ahmed I. N. Salibi
Assoc. Prof. Dr. Putra Sumari
N/A
By 2020, all the data are expected to redouble every two years, which means 5 terabytes of data for every person on Earth. The difficulty of storing and fetching required data from data centers and servers will consequently increase. As a result, significant attention has been paid to the flash memory-based Solid State Drive (SSD) which made replacing the existing Hard Disk Drive (HDD), used as a storage unit across the world, very possible. Different from traditional disks, SSD uses semiconductor chips to store data. This structure enjoys very original technical characteristics including low power consumption, shock resistance and high performance in random access. Those features can overcome the shortcomings of magnetic disks. However, flash memory the basic unit of SSD, has many distinctive characteristics that lead to various challenges. Flash memory doesn’t support update in place method. A write operation can only be performed on an empty or erased unit which makes it more time-consuming. Moreover, each storage unit has limited number of erase cycles. In this research, a new schema called Performance and Reliability Awareness (PRA) will be proposed to (i) increase the reliability and performance of SSDs, (ii) combines the efficient features of SSDs and HDDs in a hybrid storage system for efficient data processing. The eligibility of the proposed schema will be proved using widely used SSD simulation tools: the DiskSim and FlashSim simulators, in terms of effectiveness and efficiency compared with other state-of-the-art techniques.
N/A
none
2014/15
1
Mohd Redzuan Asmi
Read more...
 
Category: Research

Research Seminar: Vulgarisation of Natural Language Processing
2014-12-04
Time 1500 until 1630
Meeting Room 7th Floor
Assoc. Prof. Dr. Bali Ranaivo
N/A
N/A
One of the senses of the word "vulgarisation" is” the act of making something attractive to the general public" (WordNet 2.1). This talk will attempt to present, using layman's words, the main ideas behind the term "natural language processing" (NLP). After this presentation, which should not exceed 45mn, it is hoped that NLP is no longer a mysterious, difficult, or insignificant subject.
N/A
none
2014/15
1
N/A
Read more...
 

Related Topics

Contact Us

Home
Address:
School of Computer Sciences
Universiti Sains Malaysia
11800 USM, Penang, Malaysia
Tel: (+604) 653 3888 ext. 3647/3610
Fax: (+604) 657 3335

Best View

Screen Resolution:
1280 x 1024 or above with

Firefox Logo
FireFox6
or above

Google Chrome Logo
Google
Chrome

Internet Explorer 7 Logo
IE 6
or above

Opera Logo
Opera11.5
or above

**For IE users, please UNCHECK the `Compatibility View` setting at Tools menu for a proper view.

Login