14
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      WikiM: Metapaths based Wikification of Scientific Abstracts

      Preprint
      , , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          In order to disseminate the exponential extent of knowledge being produced in the form of scientific publications, it would be best to design mechanisms that connect it with already existing rich repository of concepts -- the Wikipedia. Not only does it make scientific reading simple and easy (by connecting the involved concepts used in the scientific articles to their Wikipedia explanations) but also improves the overall quality of the article. In this paper, we present a novel metapath based method, WikiM, to efficiently wikify scientific abstracts -- a topic that has been rarely investigated in the literature. One of the prime motivations for this work comes from the observation that, wikified abstracts of scientific documents help a reader to decide better, in comparison to the plain abstracts, whether (s)he would be interested to read the full article. We perform mention extraction mostly through traditional tf-idf measures coupled with a set of smart filters. The entity linking heavily leverages on the rich citation and author publication networks. Our observation is that various metapaths defined over these networks can significantly enhance the overall performance of the system. For mention extraction and entity linking, we outperform most of the competing state-of-the-art techniques by a large margin arriving at precision values of 72.42% and 73.8% respectively over a dataset from the ACL Anthology Network. In order to establish the robustness of our scheme, we wikify three other datasets and get precision values of 63.41%-94.03% and 67.67%-73.29% respectively for the mention extraction and the entity linking phase.

          Related collections

          Most cited references10

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Overview of BioCreAtIvE task 1B: normalized gene lists

          Background Our goal in BioCreAtIve has been to assess the state of the art in text mining, with emphasis on applications that reflect real biological applications, e.g., the curation process for model organism databases. This paper summarizes the BioCreAtIvE task 1B, the "Normalized Gene List" task, which was inspired by the gene list supplied for each curated paper in a model organism database. The task was to produce the correct list of unique gene identifiers for the genes and gene products mentioned in sets of abstracts from three model organisms (Yeast, Fly, and Mouse). Results Eight groups fielded systems for three data sets (Yeast, Fly, and Mouse). For Yeast, the top scoring system (out of 15) achieved 0.92 F-measure (harmonic mean of precision and recall); for Mouse and Fly, the task was more difficult, due to larger numbers of genes, more ambiguity in the gene naming conventions (particularly for Fly), and complex gene names (for Mouse). For Fly, the top F-measure was 0.82 out of 11 systems and for Mouse, it was 0.79 out of 16 systems. Conclusion This assessment demonstrates that multiple groups were able to perform a real biological task across a range of organisms. The performance was dependent on the organism, and specifically on the naming conventions associated with each organism. These results hold out promise that the technology can provide partial automation of the curation process in the near future.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            LitInspector: literature and signal transduction pathway mining in PubMed abstracts

            LitInspector is a literature search tool providing gene and signal transduction pathway mining within NCBI's PubMed database. The automatic gene recognition and color coding increases the readability of abstracts and significantly speeds up literature research. A main challenge in gene recognition is the resolution of homonyms and rejection of identical abbreviations used in a ‘non-gene’ context. LitInspector uses automatically generated and manually refined filtering lists for this purpose. The quality of the LitInspector results was assessed with a published dataset of 181 PubMed sentences. LitInspector achieved a precision of 96.8%, a recall of 86.6% and an F-measure of 91.4%. To further demonstrate the homonym resolution qualities, LitInspector was compared to three other literature search tools using some challenging examples. The homonym MIZ-1 (gene IDs 7709 and 9063) was correctly resolved in 87% of the abstracts by LitInspector, whereas the other tools achieved recognition rates between 35% and 67%. The LitInspector signal transduction pathway mining is based on a manually curated database of pathway names (e.g. wingless type), pathway components (e.g. WNT1, FZD1), and general pathway keywords (e.g. signaling cascade). The performance was checked for 10 randomly selected genes. Eighty-two per cent of the 38 predicted pathway associations were correct. LitInspector is freely available at http://www.litinspector.org/.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              NetiNeti: discovery of scientific names from text using machine learning methods

              Background A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information. Results We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9% and recall = 70.5%) compared to a popular dictionary based approach (precision = 97.5% and recall = 54.3%) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central’s full text articles annotated with scientific names, the precision and recall values are 98.5% and 96.2% respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages. Conclusions We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org.
                Bookmark

                Author and article information

                Journal
                2017-05-09
                Article
                1705.03264
                feb06fbc-a913-440e-b4f3-6da18a97660b

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                cs.IR

                Information & Library science
                Information & Library science

                Comments

                Comment on this article