With the growing need for and popularity of AI, crawling data from the web and summarizing this data-based business requirement is one of the most common problems many people have to deal with. The problem becomes more challenging when the data sources can be any website or even a large set of websites, as providing a generic solution that meets all business requirements is very difficult.
I have been dealing with such problems in the last couple of years and recently, I tried an approach that I found to be very reliable, easy to maintain, robust, and optimized.
Suggested read: Top artificial intelligence technologies
The approach is the 3 M approach, where the 3 M’s mean following:
- Message queue
I used the microservices approach for separation of concern. I segregated the AI algorithms and business logic from engineering work to make it easier for the team to maintain the application. There are two main reasons behind this segregation:
- Usually, data scientists/AI-ML researchers and Product engineers have different specializations. Former are expert in AI/ML while later are expert in infrastructure and engineering works.
- For most data scientists and ML researchers, the favorable programming language are Python and R because they contain a huge list of libraries for AI and is pretty easy compared to other programming languages. On the other hand, for enterprise server-side applications, Java is the most preferable, as it has very robust and powerful frameworks like Spring, Hibernate, etc., which have plenty of features to make developers’ jobs easy and simple.
Read: Most popular AI models
Data crawling from the web is a tedious and unreliable process, as every website is different from any other website. The time consumed in crawling data from one website to another also varies significantly and depends on many factors. To make this process reliable, I used RabbitMQ to queue the requests and process asynchronously. This approach helped me in processing the request in a controlled way and it helped the user not to wait for a long time while the request was being processed.
Multithreading is used in the application for parallel processing of the request queued in the message queue. Use a thread pool with configurable size. The MessageQ listener continuously observes the MessageQ and calls the thread manager to start a thread whenever it is available in the pool.
Also read: The four waves of AI
The entire application is divided into the three components mentioned below, along with the technologies used in them:
- Advisor-app: Java, Spring Boot, Spring Data, MySQL, and MongoDB
- Advisor-Msg: Spring Boot, and RabbitMQ
- Advisor-AI: Python, Django, NLP, NER, and various AI algorithms