Interested in collaborating?

Big Data Search Mechanisms

  • Client: Post production and rental facility, which specializes in VFX, grading, sound effects and more for films, tv-drama and adverts.
  • Project brief: Our client had a vast, but uncategorized library of more than 200 millions of files they use in the process of producing videos. The lack of common naming conventions and the fact that there were multiple teams working on the project made the search for a needed file a daunting task that could take hours.
  • Result: We’ve developed a solution based on full-text search algorithms that decreased the process of looking for a file from hours to milliseconds.

Business challenges

Our client is a full service facility for film, drama and ad production, with both national and international customers. The nature of the process requires adding the effects into every frame of a certain video clip, which is then saved as a separate file. Our client has accumulated over 200 million of these files over the years which were all held on physical tapes due to the nature of established filmmaking processes.

All of these tapes were made and categorized so that when the time comes, the right film is shipped to the client. This process was botched, however, as our client worked with multiple teams who did not follow the same patterns when naming the tapes. As such, the search for the right file was taking hours as it relied on searching through unstructured data.

Catware’s solutions

A simple search by name or common symbols in a file’s name is a daunting and time-consuming task as the basic algorithm does not understand context. It also avoids similar files that could correspond to the user’s intent as they don’t have an exact keyword or phrase match in their name.

This is why we chose to realize a full-text search that’s based on the Apache Lucene technology.

Data categorization

A typical search relies on searching for an appropriate result in the entirety of the database. This search needs to run through the database several times in order to find the most fitting matches which, depending on the size of the base, can take anything from several minutes to several years. We chose to categorize all of the files so that se search does not have to rely on browsing through the entirety of the DB but rather explores the index for matching results.

Full-text search

Ordinary search is incapable of understanding context or looking for words and phrases that have a similar meaning. A full-text search that codes and then indexes the entire database, on the other hand, is much more capable of delivering satisfying results. The process works in a similar way to Google’s fuzzy search algorithms.

Apache Lucene

After careful consideration, we’ve come to the conclusion that Apache Lucene is the most suitable technology to build a full-text search on. The technology is implemented in products like Jira for reasons that are similar to our client’s use cases – a base to build advanced categorization and indexing algorithms , and plenty of adapters for multiple programming languages. This choice has greatly improved time-to-market delivery and saved our client nearly 20% off the estimated project’s scope.

Open-source

We willingly chose to take the open-source route as this approach offers a series of significant benefits to our client:

  • No software licensing fees or hidden costs
  • Vast community that keeps the technology updated and relevant
  • Great documentation that simplifies the learning curve for any team tasked with maintaining the solution
  • A large market of experienced developers that minimizes the risks of vendor lock-in

The gains

Description

  • ae.svg
    A pool of well-structured, categorized, and indexed data
  • flatten.svg
    Search that is done in milliseconds
  • image.svg
    Simplicity of use
  • layers.svg
    A vast community supporting the technology

Result

We have delivered a smart, fast, and robust big data search system that streamlines the client’s workflows and shaves hours off their employee’s daily routines.

Before

  • Uncategorized data
  • Poorly realized search algorithms
  • Searching for the right file took hours

After

  • Well structured and indexed data
  • Full-text search that understand context
  • Relevant search results in milliseconds
Get in touch

Investing in new high-end software solutions may seem like a risky step, especially for a business that’s not focused around the world of IT. The right partner to walk you and your team through the experience takes away the risks and empowers you to make quick, yet informed and data-driven decisions thus greatly improving successful software adoption rates. Future-proof your business today!

Alexandra Khrenova
Alexandra Khrenova
Chief Business Development Officer