Semantic Big Data Analytics


Spinning a hadoop cluster by lunch and a big Whoa by dinner. Just another day @ AI.

Both “Semantics” and “Big data” are not just buzzwords for us at AI, but fundamental approaches to handling large datasets containing lexically heterogeneous but semantically equivalent entities. Our Health Vocabulary API (HaVoc – Health Vocabulary API) provides RESTful access to biomedical terminologies to perform specialized semantic operations, such as a) link two records one coded as “hypertension” and another as “high blood pressure” or b) group records by semantic concept hierarchy (for example, retrieve all records for conditions under the group cardiovascular diseases).

Beyond the semantics, we have developed a proprietary in house data processing framework called Garbanzo to support the entire ETL, integration, analytics and query. Our record linkage / de-duplication algorithms run on large Apache Spark clusters and produce blazingly fast results with high specificity and sensitivity. 

Like all other endeavors, we have much to look forward to as we grow our capabilities to integrate clinical, research and personal self-quantification datasets to produce meaningful insights.

Distributed Record Linkage/De-duplication
    • Linking biomedical entities, such as healthcare providers (for example, J Smith, JM Smith, J Smith MD) across different datasets can get complex very quickly when dealing with dirty, incomplete and multi-attribute data from different databases.
    • We have developed data processing frameworks and record linking algorithms to support large scale record linking and data analytics
    • Contact us to learn more about our “Select Star” Semantic and Big Data Services
Data Crawling and Cleaning
    • We have developed a distributed web data crawling and parsing system that scales over billion of web pages and hundreds of servers.
    • We used this system to track online reviews and ratings for various types of entities
    • We have also developed several automated data cleaning functions and manual review systems that feed back into our data processing framework, Garbanzo
    • Contact us to learn more about our data crawling services
Data Analytics APIs
    • Our advanced knowledge engineering and data modeling expertise, we are able to develop very generic data analytics and APIs over the processed data
    • We use ElasticSearch extensively to provide a scalable API end-point for client apps
Natural Language Processing
    • We have deep expertise in natural language processing methods such as biomedical entity and relationship extraction,  negation utterances identification and determining semantic equivalence of concepts and entities.
    • Our award winning clinical trial finder, Ask Dory uses NLP to parse clinical trial inclusion and exclusion criteria to provide the most advanced matching results for patient recruitment. Read our papers on this awesome technology
    • Contact us to see how we can Apply NLP to your datasets

Products & Solutions

Research Star

ResearchTracker combines public data from, NIH Grant (Reporter database), PubMed and Sunshine Act Open Payment data, it integrates the research investigators and sites using record linkage algorithms and provides a dashboard of latest research activity at a given research university. It also enables aggregation of various stats for a keyword in delayed real-time.It also enables aggregation of various statistics for any medical disease or treatment keyword.

Investigator Databank

As a client solution, we developed the Investigator Databank to enable pharma companies including Pfizer, Merck, Lilly, Janssen and Novartis to share site and investigator metrics with each other. The Databank uses our data matching, record linking and semantic search technology to integrate and query millions of investigator and site records across these member companies, in a manner that safeguards their privacy clauses. We also extended the solution to develop the Transcelerate Investigator Registry solution used by 21 pharma companies across the world.


Garbanzo framework is a data processing toolkit that underscores all our big data work, it has been developed after years of fine-tuning and abstracting repetitive tasks in data science. We have amassed a library of functions to clean data, organize data, integrate with external databases and enhance data.


Left Brain + Right Brain = Total Package for You.