Interested in collaborating?

Big Data Search Mechanisms

The system

Our client had a vast, but uncategorized library of more than 200 millions of files they use in the process of producing videos. The lack of common naming conventions and the fact that there were multiple teams working on the project made the search for a needed file a daunting task that could take hours.


Scope of work

  • Data categorization

    A typical search relies on searching for an appropriate result in the entirety of the database. This search needs to run through the database several times in order to find the most fitting matches which, depending on the size of the base, can take anything from several minutes to several years. We chose to categorize all of the files so that se search does not have to rely on browsing through the entirety of the DB but rather explores the index for matching results.

  • Full-text search

    Ordinary search is incapable of understanding context or looking for words and phrases that have a similar meaning. A full-text search that codes and then indexes the entire database, on the other hand, is much more capable of delivering satisfying results. The process works in a similar way to Google’s fuzzy search algorithms.

  • Apache Lucene

    After careful consideration, we’ve come to the conclusion that Apache Lucene is the most suitable technology to build a full-text search on. The technology is implemented in products like Jira for reasons that are similar to our client’s use cases – a base to build advanced categorization and indexing algorithms , and plenty of adapters for multiple programming languages. This choice has greatly improved time-to-market delivery and saved our client nearly 20% off the estimated project’s scope.

200 mln+
Files categorized
Hours were needed to complete search before
Milliseconds were needed to complete the search after

Talanted Team

  • 1 solutions architect (Oslo)
  • 2 senior Java developers (Kyiv)


Client's benefits

  • A pool of well-structured, categorized, and indexed data.

  • Search that is done in milliseconds.

  • Simplicity of use as Apache Lucene is compatible with most programming languages out of the box.

  • A vast community supporting the technology.