International Journal of Engineering
Trends and Technology

Research Article | Open Access | Download PDF

Volume 45 | Number 1 | Year 2017 | Article Id. IJETT-V45P220 | DOI : https://doi.org/10.14445/22315381/IJETT-V45P220

Published Date Extraction System a Semi-Supervised Approach of Extraction


Nitin Kumar, Abhishek Pradhan

Citation :

Nitin Kumar, Abhishek Pradhan, "Published Date Extraction System a Semi-Supervised Approach of Extraction," International Journal of Engineering Trends and Technology (IJETT), vol. 45, no. 1, pp. 87-92, 2017. Crossref, https://doi.org/10.14445/22315381/IJETT-V45P220

Abstract

The need to extract a meaningful or relevant dates like published date from an unstructured document is a very vital cog in the wheel of information extraction and data mining field. The current approaches usage DOM (Document Object Model) manipulation for an HTML document or regex expression and rules from metadata which are not so accurate for different types of publication. The recent work in this area mainly focused on web pages and HTML pages with some good accuracy. Our approach took a leaf from those works for HTML, and along with that it extensively covers PDF document, Blog articles, and Websites. It supports several types of documents like News Articles, Patents, Scientific Articles/Journal in PDF format, Blogs, Websites and more. It also has the capabilities to learn over the period and feed the learnings back to the system as trained model. Our algorithm comprises of both supervised and unsupervised steps, and it uses natural language processing techniques.

Keywords

CRF modelling for Segment extraction, data mining, information extraction, published date

References

[1] Chen, Z., Ma, J., Rui, H., & Ren, Z. (2010, January). Web Page Publication Date Extraction and Application. Journal of Computational Information Systems.
[2] Prokhorenkova, L. O., Prokhorenkov, P., Samosvat, E., & Serdyukov, P. (2016). Publication Date Prediction through Reverse Engineering of the Web. WSDM 2016.
[3] Garcia-Fernandez, A., Ligozat, A.-L., Dinarelli, M., & Bernhard, D. (2011). When was it Written? Automatically Determining Publication Dates.
[4] Lopez, P. (2009). GROBID: Combining Automatic BibliographicData Recognition and Term Extraction ForScholarship Publications. Research and Advanced Technology for Digital Libraries, 13th European Conference, ECDL 2009. Corfu, Greece.
[5] Lopez, P. (2010). Automatic Extraction and Resolution of Bibliographical References in Patent Documents. Advances in Multidisciplinary Retrieval, First Information Retrieval Facility Conference, IRFC 2010. Vienna, Austria.
[6] Mario Lipinski, K. Y. (2013). Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents. (pp. 385-386). ACM/IEEE-CS.
[7] Gamon, M. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. COLING `04 Proceedings of the 20th International conference on Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics.
[8] Geva, R. (2015, December 13). Article’s publication date extractor – an overview. Retrieved from blog.webhose.io: http://blog.webhose.io/2015/12/13/articles-publication-dateextractor-an-overview/
[9] Lu, Y., Meng, W., Zhang, W., & Yu, C. (2006). Automatic Extraction of Publication Time from News Search Results. Data Engineering Workshops, 2006. IEEE.
[10] Tannier, X. (2014). Extracting News Web Page Creation Time with DCTFinder. Irec-conf. Irec.
[11] Tkaczyk, D., Tarnawski, B., & Bolikowski, Ł. (2015, November). Structured Affiliations Extraction from Scientific Literature. D-Lib Magazine, pp. 11-12.
[12] Tekstosense. (2016). PDF Segmenter. Retrieved from TekstoSense: http://apis.tekstosense.com
[13] Wang, P., Domeniconi, C., & Hu, J. (2008, January). Crossdomain Text Classification using Wikipedia. IEEE Intelligent Information Bulletin.
[14] Bloom, N., Theune, M., & Jong, F. D. (n.d.). Using Wikipedia with associative networks for document classification. University of TWENTE. UTpublications.
[15] Raw MemeTracker phrase data. (n.d.). Retrieved from Memtracker: http://www.memetracker.org/data.html

Time: 0.0014 sec Memory: 32 KB
Current: 1.88 MB
Peak: 4 MB