Purpose
Our client’s data is growing on a daily basis and human verification of the data was slow and inadequate. The lack of data validation resulted in data inconsistencies, leading to web page errors and an increased difficulty to analyse the data. We identified that an automated data cleaning and data validation framework was required to keep up with the volume of data that was generated daily. A requirement of our client was that existing data entry tools remained functional, and made compatible with the automated data processing framework that we deployed.
Approach
We set up a distributed data processing pipeline to enable rapid cleaning and validation of newly generated data, and also the migration of older unvalidated data. The framework that we developed automatically scales-up with the growing volume of data, by distributing tasks on available nodes. Data that fail the validation are automatically labelled with specific tags detailing the errors on each fields. Annotated errors are fed back into our client’s legacy data entry interfaces, allowing for a review of inconsistencies. We reimplemented pricing analytics to make use of the new technology stack, using only cleaned and validated data to provide a consistent output.
Outcome
Overall our solution helped to restructure our client’s data, improved his data usability and modernised his technology stack. With our data cleaning and validation approach, the occurrence of page errors due to inconsistent data dropped to near 0%. The pricing analytics that we reimplemented ran considerably faster, and the new backend also enabled our client to plug additional data analysis scripts and to visualise the output. Our client kept his existing data entry tools, with only minor modifications to retrieve annotated errors to allow human investigation.