Our client had a vast, but uncategorized library of more than 200 millions of files they use in the process of producing videos. The lack of common naming conventions and the fact that there were multiple teams working on the project made the search for a needed file a daunting task that could take hours.
A typical search relies on searching for an appropriate result in the entirety of the database. This search needs to run through the database several times in order to find the most fitting matches which, depending on the size of the base, can take anything from several minutes to several years. We chose to categorize all of the files so that se search does not have to rely on browsing through the entirety of the DB but rather explores the index for matching results.
Ordinary search is incapable of understanding context or looking for words and phrases that have a similar meaning. A full-text search that codes and then indexes the entire database, on the other hand, is much more capable of delivering satisfying results. The process works in a similar way to Google’s fuzzy search algorithms.
After careful consideration, we’ve come to the conclusion that Apache Lucene is the most suitable technology to build a full-text search on. The technology is implemented in products like Jira for reasons that are similar to our client’s use cases – a base to build advanced categorization and indexing algorithms , and plenty of adapters for multiple programming languages. This choice has greatly improved time-to-market delivery and saved our client nearly 20% off the estimated project’s scope.