Big Data Search Mechanisms

Client CG studio that creates special effects for large studios
Project brief Our client had a vast, but uncategorized library of more than 200 millions of files they use in the process of producing videos. The lack of common naming conventions and the fact that there were multiple teams working on the project made the search for a needed file a daunting task that could take hours.
Result We’ve developed a solution based on full-text search algorithms that decreased the process of looking for a file from hours to milliseconds.

Business challenges

Our client makes computer graphics and special effects for large film making companies. The nature of the process requires adding the effects into every frame of a certain video clip, which is then saved as a separate file. Our client has accumulated over 200 million of these files over the years which were all held on physical tapes due to the nature of established filmmaking processes.

All of these tapes were made and categorized so that when the time comes, the right film is shipped to the client. This process was botched, however, as our client worked with multiple teams who did not follow the same patterns when naming the tapes. As such, the search for the right file was taking hours as it relied on searching through unstructured data.

Catware’s solutions

A simple search by name or common symbols in a file’s name is a daunting and time-consuming task as the basic algorithm does not understand context. It also avoids similar files that could correspond to the user’s intent as they don’t have an exact keyword or phrase match in their name.

This is why we chose to realize a full-text search that’s based on the Apache Lucene technology.

Data categorization

A typical search relies on searching for an appropriate result in the entirety of the database. This search needs to run through the database several times in order to find the most fitting matches which, depending on the size of the base, can take anything from several minutes to several years. We chose to categorize all of the files so that se search does not have to rely on browsing through the entirety of the DB but rather explores the index for matching results.

Full-text search

Ordinary search is incapable of understanding context or looking for words and phrases that have a similar meaning. A full-text search that codes and then indexes the entire database, on the other hand, is much more capable of delivering satisfying results. The process works in a similar way to Google’s fuzzy search algorithms.

Apache Lucene

After careful consideration, we’ve come to the conclusion that Apache Lucene is the most suitable technology to build a full-text search on. The technology is implemented in products like Jira for reasons that are similar to our client’s use cases – a base to build advanced categorization and indexing algorithms , and plenty of adapters for multiple programming languages. This choice has greatly improved time-to-market delivery and saved our client nearly 20% off the estimated project’s scope.

Open-source

We willingly chose to take the open-source route as this approach offers a series of significant benefits to our client:

No software licensing fees or hidden costs
Vast community that keeps the technology updated and relevant
Great documentation that simplifies the learning curve for any team tasked with maintaining the solution
A large market of experienced developers that minimizes the risks of vendor lock-in

The gains

After Catware’s engineers were done with the client’s pool of data, they were left with a simple, lightweight solution that saved them hours on a daily basis while offering a series of unexpected benefits:

A pool of well-structured, categorized, and indexed data
Search that is done in milliseconds
Simplicity of use as Apache Lucene is compatible with most programming languages out of the box
A vast community supporting the technology

Result

We have delivered a smart, fast, and robust big data search system that streamlines the client’s workflows and shaves hours off their employee’s daily routines.

Before

Uncategorized data
Poorly realized search algorithms
Searching for the right file took hours

After

Well structured and indexed data
Full-text search that understand context
Relevant search results in milliseconds

I enjoyed my cooperation with Catware. They were fast to deliver their solution and the pricing was reasonable. But most importantly, I was impressed with the team's proactive approach.
Torulf Henriksen
CTO, Partner at Storyline

Get in touch

Investing in new high-end software solutions may seem like a risky step, especially for a business that’s not focused around the world of IT. The right partner to walk you and your team through the experience takes away the risks and empowers you to make quick, yet informed and data-driven decisions thus greatly improving successful software adoption rates. Future-proof your business today!

Petro Krasnomovets

CEO