Machine Learning Pipeline for Biochemistry

Machine Learning Pipeline for Biochemistry


Our client is a major international pharmaceutical company, which conducts research and development activities related to a wide range of human medical disorders, including mental illness, neurological disorders, anesthesia and analgesia, gastrointestinal disorders, fungal infection, allergies and cancer.


Deliver very fast and highly scalable calculation pipeline that uses different machine learning algorithms to learn and predict chemical compound activity to reduce amount of real experiments. Provided solution should be able to handle different types and formats of both input and output data through additional abstraction layer that suits all needs of our customer and other scientific groups that are involved in the development process.

Pipeline should be cross-platform with ability to run at least under Windows and Linux environments. It should be possible to distribute complex calculations both on CPU core and cluster level.  Additional challenge is introduced by cooperation between main development team and scientific teams including two universities located in Belgium and Austria. This requires ability to work as one team in multi-language environment including C, Java, Python and R, and skill to integrate all parts into the pipeline.


C++ is chosen as main development language. This choice is obvious because C++ allows development of high performance code and it is most common language for scientific community. On the other hand, we integrate into the pipeline other solutions, which were originally introduced on C, Java, Python and R. To suit the needs of different scientific groups, we plan to use SWIG to allow interfacing with the pipeline in the future.

The whole pipeline is written on C++. The only exceptions are Java Service that is used for fingerprint calculation and R molecule similarity package that serves as interfacing library for corresponding kernel similarity metrics implemented on C++. Java service relies on jCompoundMapper library which is used for fingerprinting of chemical compounds. This library had number of minor bugs which required additional fixes of the original version.

Any molecule similarity methods always heavily rely on chemistry libraries.  We use professional C++ solution provided by OpenEye which is called OEChem TK and is also available in Python. Unfortunately, this library requires a license, so we developed additional abstraction layer to allow interfacing with other chemistry libraries. We plan to include RDKit library which also has C++ and Python versions.

To allow compilation on different platforms, we use cmake. Compared to other alternatives, it is highly automated, configurable and easy to use, allowing us to generate makefiles for required platforms. Our pipeline relies on STL and Boost C++ which is and obvious choice. Boost library allows us to write cross-platform code and concentrate on core features during development process making it more robust.

To achieve high performance and distribute complex calculations, we use TBB C++ library developed by Intel professionals. Compared to other alternatives, TBB has greatest performance, it is cross-platform, compiler independent and components of the library can be used separately. It also has intuitive API and good documentation, which is very important. Additionally we allow distribution of calculations over the cluster because pipeline can run different parts independently consuming settings for particular node specified in JSON format.

Input and output data comes in different types and formats. To handle large amounts of text based and binary data, we use cross platform compression libraries including zlib and bzip2. Typical way of data sharing in scientific community is HDF file format. It is a binary format designed for large datasets and it supports compression. We use HDF C++ library provided by HDF Group. It also has implementations on R and Python. Unfortunately, this library is not multi-threaded and thread-safety is not stable relying on pthread library which is not available on all platforms. We had to introduce additional multi-threading and thread-safe wrapper layer to suit our needs.

Another option for handling large amounts of data is introduced by using Redis. It is very simple high performance key-value storage which allows us to avoid over-complication brought by traditional relational databases. There are different versions of Redis clients including C++, Python and R versions. We use official hiredis C library. Unfortunately, Windows version of this library has certain limitations; we overcome them by improving it.

Classical machine learning algorithms are implemented within the pipeline. We also use additional libraries to increase their number. One of them is libsvm Support Vector Machine library which we adapted to the pipeline with variety of fixes and improvements. Another option that we are planning to integrate is MultiBoost learner library.

Automatic testing of different algorithms is one of the important parts during development process. We use Boost test framework because it is part of Boost C++ library and has easy integration with cmake.

There are also other post processing steps written on R, Java and Python and developed by other scientific groups. In the future we are planning to allow interfacing with the pipeline by using SWIG.

Results / Benefits

Our solution makes it possible to handle large amounts of data provided in different types and formats that are common for scientific community. Pipeline utilizes different machine learning algorithms allowing distribution of complex calculations over multiple cores or even a cluster. 

Use of C++ as main development language together with cmake allows our solution to be cross-platform.  All libraries are chosen in the way that we can reduce development time and introduce additional interfacing layer in the future for such languages as R and Python which is very important for scientific community.


Big Data ETL
IP Phone