Garden BLOOM: Branching Lookup Optimized for Organic Molecules
Introduction
We're excited to announce the release of our new search engine for Markush structure search for small molecules. We've talked to a number of patent lawyers working in the space of life science and small molecules who have been frustrated with the lack of thorough and accurate patent search given a structure they are interested in. Also with the rise of AI-generated structures to accelerate drug discovery, it's more important than ever that the diligence side of life science patents accelerates just as fastYaorui Shi, Sihang Li, Taiyan Zhang, Xi Fang, Jiankun Wang, Zhiyuan Liu, Guojiang Zhao, Zhengdan Zhu, Zhifeng Gao, Renxin Zhong, Linfeng Zhang, Guolin Ke, Weinan E, Hengxing Cai, and Xiang Wang. Intelligent system for automated molecular patent infringement assessment, 2025..
Being able to do this search accurately and efficiently is necessary to facilitate focused and efficient drug discovery for companies, as these searches not only are necessary during the final patent process but are instrumental in driving research direction. For instance, a drug company might be interested in three different kinds of Markush structures. Rather than spend time and money into pursuing each to see which one is most scientifically valid, performing a structure search would give insight into which of the potential structures have already been "covered" in the patent space.
What is a Markush structure?
A Markush structure is a representation of a set of small molecules, where a core set of atoms contain R-branches. These R-branches are meant to be stand-ins to allow for a chemist to describe a broad class of molecules. This variability enables the protection of entire families of compounds from a legal standpointXi Fang, Jiankun Wang, Xiaochen Cai, Shangqian Chen, Shuwen Yang, Haoyi Tao, Nan Wang, Lin Yao, Linfeng Zhang, and Guolin Ke. Molparser: End-to-end visual recognition of molecule structures in the wild, 2025.. Fig. 1 shows a representation of the Markush string, [R2]c1ccc([R1])cc1. This Markush represents all molecules that contain a carbon ring with two R-branches connected to the carbons at those positions. These R-branches could be individual atoms or a sequence, and patent lawyers might want to specify the constraints for each branch.
Figure 1: 2-D representation of a Markush query with two R-branches off a carbon ring.
Challenges with Markush Structure Search
Structure search is difficult because SMILES strings can be long, and searching across millions of them can incur large amounts of computation timePei-Hua Wang, Jen-Hao Chen, and Yufeng Jane Tseng. Author correction: Intelligent pharmaceutical patent search on a near-term gate-based quantum computer. Scientific Reports, 12(1):2033, Feb 2022. Published Erratum to: Sci Rep. 2022 Jan 7;12(1):175. doi: 10.1038/s41598-021-04031-y. PMID: 34997034.. Structures also can be written in different ways as a string, which might add complexity in search. A simple example of this is ethanol, which has the SMILES string, CCO. However it can also be written as C(O)C and OC(C). As a result, it's important to canonicalize SMILES strings, which is a way of standardizing them. Doing so helps reduce redundancy but still does not address structural details in the molecule.
Core extraction is a popular method used in Markush search today. The idea is to take out the core atoms in the Markush query by groups, and return any SMILES strings that match a particular group or all the groups. Core extraction is a popular precursor to substructure search, where a central scaffold is defined and used for substructure matching, and is used by well known chemical patent databases such as the World Intellectual Property Organization and SureChemBLPei-Hua Wang, Jen-Hao Chen, and Yufeng Jane Tseng. Author correction: Intelligent pharmaceutical patent search on a near-term gate-based quantum computer. Scientific Reports, 12(1):2033, Feb 2022. Published Erratum to: Sci Rep. 2022 Jan 7;12(1):175. doi: 10.1038/s41598-021-04031-y. PMID: 34997034..
Fingerprint matching is a similar idea, which aims to convert molecular structures into vectorized representations, and use them as a way to search across a large corpora of SMILES stringsJohn M Barnard, Geoff M Downs, Annette von Scholley-Pfab, and Robert D Brown. Use of markush structure analysis techniques for descriptor generation and clustering of large combinatorial libraries. Journal of Molecular Graphics and Modelling, 18(4):452–463, 2000.Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Luc Van Gool, and Peter W. J. Staar. Subgrapher: Visual fingerprinting of chemical structures, 2025.. There are different kinds of fingerprint algorithms, such as substructure key-based and topologicalAdrià Cereto-Massagué, María José Ojeda, Cristina Valls, Miquel Mulero, Santiago Garcia-Vallvé, and Gerard Pujadas. Molecular fingerprint similarity search in virtual screening. Methods, 71:58–63, 2015. Virtual Screening.. Although reducing the complexity of structures can aid with the speed of search, it naturally affects the accuracy. Fingerprint matching may ignore edge connectivity or context when aggregating features. Fingerprint matching suffers most with understanding subtle changes around local features, which is an important requirement in Markush search. A single R branch might result in a completely novel molecule as opposed to one that has been cited countless amounts of time in prior literature. In practice, it was shown that all these kinds of fingerprints sometimes are unable to distinguish between different chemical environmentsBehnam Parsaeifard, Deb Sankar De, Anders S Christensen, Felix A Faber, Emir Kocer, Sandip De, Jörg Behler, O Anatole von Lilienfeld, and Stefan Goedecker. An assessment of the structural resolution of various fingerprints commonly used in machine learning. Machine Learning: Science and Technology, 2(1):015018, March 2021..
Verification is also a challenge in Markush search. Even if a string based search algorithm can find matches of a Markush structure, one would still have to verify if the SMILES string matches the pattern by checking atom by atom.
To illustrate the difficulty of this problem as well as the ineffectiveness of these kinds of solutions described above, we built out a string based search algorithm similar to ones used by competitors today. Fig. 2 shows two examples of search results returned by the algorithm for the simple carbon ring with two R-branches shown in Fig. 1.
(a) In this false match, the SMILES string does have a carbon ring but has three branches while our Markush structure only asked for two branches.
(b) In this false match, the SMILES string has the same issue as Fig. 2a but the branches are also too close together.
Figure 2: Two false positives from core extraction string search, showcasing some of the pitfalls that may arise.
The core extraction string search struggles with understanding nuanced structural details. For instance, although it finds results with carbon rings, it fails to understand the requirements for the number of bonds the rings has as well as the positioning. Results like this waste hours of patent lawyers' time, forcing them to verify structures by themselves and delaying the time it takes to make important strategic decisions about direction or filing a patent draft.
Our Approach
Garden BLOOM presents a novel approach to Markush search, using a graph-based method to compare a query Markush structure across a database of SMILES strings. BLOOM is able to understand structural complexities across the set of SMILES strings it is trying to match against, thereby avoiding the issues that standard string search or fingerprint search have in other competitor products. Unlike traditional approaches, BLOOM takes an agentic approach to traversal. As it compares a query Markush structure against a candidate molecule, BLOOM dynamically evaluates the match path, identifying early stop conditions through distinguishing atomic and bond features that allow it to short-circuit invalid candidates. This results in significant speedups without sacrificing accuracy, an essential capability when operating over millions of candidate structures.
Graph based methods in Markush search have been used before, with approaches attempting to reduce molecules similar to core extraction, and then matching afterwardsLucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Luc Van Gool, and Peter W. J. Staar. Subgrapher: Visual fingerprinting of chemical structures, 2025.Winfried Dethlefsen, Michael F. Lynch, Valerie J. Gillet, Geoffrey M. Downs, John D. Holliday, and John M. Barnard. Computer storage and retrieval of generic chemical structures in patents. 11. theoretical aspects of the use of structure languages in a retrieval system. Journal of Chemical Information and Computer Sciences, 31(2):233–253, May 1991.. Reduction-based graph algorithms only remove potentially important local features in structures, leading to a higher rate of false positives or missed results.
We tested Garden BLOOM against an implementation of core extraction string search commonly found in products from competitors across three varying different Markush queries:
- [R1][C@H](NC(=O)O[R2])C(=O)N1CCCC1c1ncc([R3])[nH]1
- C12=C(C=CC=C1N(C(CO2)=O)[H])N(C(=O)N(CC3=C(N=C(C=C3)[R1])[R2])[H])[H]
- CCCC([R1])([R2])c1ccc(N(c2ccc(C(C)CC)cc2)c2ccc(C(C)(C)CCC)cc2)cc1
The third Markush query was intentionally designed off a single SMILES string in the entire database, with the goal for either algorithm to find a single hit in the database of millions of SMILES strings. R-groups are written in brackets with a corresponding numeric identifier to identify between different branches (ex. [R1], [R2]). Both algorithms were run on the same architecture and system to isolate how effective each would be at searching. We measured the following:
- Average computation time per comparison: this measures how long it takes on average to compare one SMILES string to the given Markush query, and has an implication on the total runtime of the entire algorithm.
- Accuracy: are the SMILES strings returned in the search actual matches. This tests for false positives.
- Unique matches: does either method find SMILES strings that the other algorithm does not - in which aspects do either algorithm dominate the other in terms of the search space.
We verify the accuracy of each algorithm using a variant of our graph-based method, which now only needs to evaluate on a fixed set of SMILES strings rather than across our entire database of millions.
Figure 3: A comparison between Garden BLOOM and a standard core extraction string search algorithm. Both algorithms were tested on the same architecture for a pure comparison. BLOOM dominates across all three metrics that measure speed, accuracy, and search result quality.
Across three different Markush queries, BLOOM averages 32.44x speed improvement, taking on average 0.047 milliseconds per computation across these different queries, whereas string search takes on average 1.491 milliseconds. When scaled to millions of computations across a database of SMILES strings, this is a massive improvement in speed.
Moreover, in terms of accuracy, BLOOM is able to find matches for all three Markush queries, while the standard string search algorithm can only find a match for one of the queries. Even with the second query where the string search algorithm finds a match, it does not find the other 8 SMILES strings that match the structure, which BLOOM finds. As a result, BLOOM dominates in being able to thoroughly search across aspects of the search space that standard string algorithms fail to. As mentioned earlier, the third Markush query was designed with the intention of only having one search result, which BLOOM is able to find but the string search is unable to.
Taking a closer look at the second Markush query where the string search was able to return one match, we investigate the kinds of SMILES strings BLOOM found which the string search was unable to. This particular molecule has 60 atoms, giving traditional string search or core extraction approaches plenty of opportunities to mess up. BLOOM instead is able to match the given query exactly onto the large SMILES string. The color coded mapping illustrates that BLOOM can also help with verification of searches once done, providing important validation to any set of search results and saving the patent lawyer from having to do so themselves.
(a) Markush structure for the second query. There are two R-branches off the left hand ring.
(b) A large SMILES molecule that contains the Markush structure queried. BLOOM is able to match across local features at speed.
Figure 4: A SMILES string in our database that BLOOM finds which the string search algorithm was unable to find for the second Markush query.
Two other SMILES results for this Markush query are:
- O=C1COc2c(cccc2NC(=O)NCc2ccc(C(F)(F)F)nc2N2CCCCC2)N1 as seen in Fig. 5a
- O=C1COc2c(cccc2NC(=O)NCc2ccc(C(F)(F)F)nc2C2CCCCC2)N1 as seen in Fig. 5b
(a) Here the two R-branches in the Markush query are the two grayed out branches.
(b) Very similar structure to Fig. 5a, but still gets identified by BLOOM, which does not get confused between the two.
Figure 5: Two more SMILES strings in our database that BLOOM finds which the string search algorithm was unable to find for the second Markush query.
These two structures are incredibly close to each other, in fact only differing by just one atom (check out the size and composition of the grey ring in both images). Once again, note that BLOOM adheres to bond requirements along with atom requirements, and can find matches across a wide suite of SMILES strings, no matter how similar they might be.
Integration with Real World Workflows
We've already integrated BLOOM with our world-class patent database, extracting out relevant patents per SMILES result for the user. Moreover, the user can converse with our Garden AI agent to quickly understand technical details across the patents returned in the search and help prune their search even further. BLOOM easily fits into any patent lawyer's workflow, regardless if they're trying to make a series of searches or individual ones shaped by exploration.
Garden BLOOM ushers in a new state-of-the-art search system for organic molecules, allowing patent lawyers to save countless hours of time during FTO, molecule searches, and exploration. Incorrect search results from other approaches might show false positives, which only get exposed if verified. When dealing with search result sets of thousands of molecules, this can be highly expensive. BLOOM not only is incredibly accurate, but also has built in verification so lawyers can execute on important decisions faster and with more clarity.
Conclusion
Garden BLOOM offers state-of-the-art Markush structure search capabilities, allowing users to thoroughly and accurately find matching SMILES strings from our patent database. Reach out to sales@gardenintel.com if you are interested in trying it out.
References
- 1Yaorui Shi, Sihang Li, Taiyan Zhang, Xi Fang, Jiankun Wang, Zhiyuan Liu, Guojiang Zhao, Zhengdan Zhu, Zhifeng Gao, Renxin Zhong, Linfeng Zhang, Guolin Ke, Weinan E, Hengxing Cai, and Xiang Wang. Intelligent system for automated molecular patent infringement assessment, 2025.
- 2Xi Fang, Jiankun Wang, Xiaochen Cai, Shangqian Chen, Shuwen Yang, Haoyi Tao, Nan Wang, Lin Yao, Linfeng Zhang, and Guolin Ke. Molparser: End-to-end visual recognition of molecule structures in the wild, 2025.
- 3Pei-Hua Wang, Jen-Hao Chen, and Yufeng Jane Tseng. Author correction: Intelligent pharmaceutical patent search on a near-term gate-based quantum computer. Scientific Reports, 12(1):2033, Feb 2022. Published Erratum to: Sci Rep. 2022 Jan 7;12(1):175. doi: 10.1038/s41598-021-04031-y. PMID: 34997034.
- 4John M Barnard, Geoff M Downs, Annette von Scholley-Pfab, and Robert D Brown. Use of markush structure analysis techniques for descriptor generation and clustering of large combinatorial libraries. Journal of Molecular Graphics and Modelling, 18(4):452–463, 2000.
- 5Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Luc Van Gool, and Peter W. J. Staar. Subgrapher: Visual fingerprinting of chemical structures, 2025.
- 6Adrià Cereto-Massagué, María José Ojeda, Cristina Valls, Miquel Mulero, Santiago Garcia-Vallvé, and Gerard Pujadas. Molecular fingerprint similarity search in virtual screening. Methods, 71:58–63, 2015. Virtual Screening.
- 7Behnam Parsaeifard, Deb Sankar De, Anders S Christensen, Felix A Faber, Emir Kocer, Sandip De, Jörg Behler, O Anatole von Lilienfeld, and Stefan Goedecker. An assessment of the structural resolution of various fingerprints commonly used in machine learning. Machine Learning: Science and Technology, 2(1):015018, March 2021.
- 8Winfried Dethlefsen, Michael F. Lynch, Valerie J. Gillet, Geoffrey M. Downs, John D. Holliday, and John M. Barnard. Computer storage and retrieval of generic chemical structures in patents. 11. theoretical aspects of the use of structure languages in a retrieval system. Journal of Chemical Information and Computer Sciences, 31(2):233–253, May 1991.