BioVec-Ita: Biomedical Word Embeddings for the Italian Language

Bavaro, Marcello; Dolci, Tommaso; Piantella, Davide

In the healthcare field, the information created by digital technologies that collect clinical care notes, health service reports and patients' records, is generating terabytes of data, a great part of which is in textual format. These datasets may become an incredibly valuable asset only if the knowledge they carry is extracted using the appropriate artificial intelligence techniques, and specifically natural language processing (NLP) ones. Unfortunately, most existing tools support NLP for the English language, while local administrations and hospitals typically work in their native language, and therefore it becomes very important to have NLP tools to process biomedical data written also in these languages. Word embeddings are a popular and powerful NLP technique to extract semantics from textual data that could be very useful to solve the problem, but unfortunately for the Italian language there are no such tools specialized in the biomedical field. In this paper we propose BioVec-Ita, a new word embedding model for Italian, specialized in the biomedical field and designed using Word2vec, a flexible model for semantic representation that can be easily integrated with other pipelines. We also evaluate the performance of our word embeddings model in capturing the semantic similarities of biomedical terms, using three very popular test datasets translated into Italian.

BioVec-Ita: Biomedical Word Embeddings for the Italian Language

Abstract

BibTeX