Doctoral meeting: 'Efficient query over large datasets of analytical chemistry'

One of the main problems encountered by chemical and pharmaceutical industries when discovering a new compound is the identification and elimination of known replicas in an early stage of the research. For making this identification process effective, it is necessary to use chemoinformatic techniques capable of performing fast searches in large databases of molecular data, which, in addition to conventional data, include molecular structures, spectra and chromatograms. In this thesis, solutions for storing, indexing and searching, both analytical information and molecular structures are provided.

The main objective of the work focuses in substructure searching, deciding whether a query structure is substructure of some other. Molecular structures are treated as graphs, and searches are solved following the filter-then-verify paradigm, where an indexing technique is first used to obtain result candidates, and a subgraph isomorphism algorithm is next applied to the candidates in a verification stage. Three new indexing techniques which leverage the use of bitmaps where proposed to solve this problem, which have been implemented in both centralized and distributed architectures.

Supervisors: José Ramón Ríos Viqueira and Tomás Fernández Pena