Skip to content

AdilAhmedunar/Internship_Project_on_Machine_Learning_Pipelines

Repository files navigation

Startup Acquisition Status Modeling Using Machine Learning Pipelines

Our project aims to analyze the financial circumstances of companies and their fundraising objectives.

Project Description

The project aims to predict the acquisition status of startups based on various features such as funding rounds, total funding amount, industry category, and geographic location. The objective is to develop a machine learning model that accurately classifies startups into different acquisition status categories, including Operating, IPO, Acquired, or closed. This problem will be addressed using a Supervised Machine Learning approach by training a model based on the historical data of startups that were either acquired or closed. By leveraging machine learning pipelines, we preprocess the data, select relevant features, and train models to classify startups into different acquisition status categories. The project utilizes Python libraries such as scikit-learn, pandas, matplotlib, seaborn, joblib, and XGBoost for model development and evaluation. The goal is to provide insights into the factors influencing startup acquisition and build a predictive tool that can assist stakeholders in making informed decisions.

Software Development Life Cycle (SDLC) Model

Agile Approach

Our project follows the Agile Software Development Life Cycle (SDLC) model, which is well-suited for iterative and collaborative projects like machine learning development. The Agile approach emphasizes flexibility, adaptability, and customer collaboration throughout the project lifecycle. Here's how we applied Agile principles in our project:

  1. Iterative Development: We embraced iterative development cycles to continuously refine and improve our machine learning models based on feedback and new insights gained during each iteration.

  2. Collaboration and Communication: Agile principles encouraged regular collaboration and communication among team members, enabling effective management of the project's complexity and ensuring alignment with stakeholders' expectations.

  3. Adaptability to Change: Agile's adaptive approach allowed us to respond quickly to changes in project requirements, data characteristics, and model performance, ensuring that our solutions remained relevant and effective.

  4. Instructor Feedback: We actively sought feedback from our mentor and incorporated it into our development process, ensuring that our machine learning models met their needs and expectations.

  5. Continuous Improvement: Agile principles fostered a culture of continuous improvement, prompting us to regularly reflect on our processes and outcomes, identify areas for enhancement, and implement changes to deliver higher-quality solutions.

By following the Agile SDLC model, we effectively managed the complexity and uncertainty inherent in machine learning projects, delivering valuable and robust solutions to predict startup acquisition status.

Implementation of Agile Practices

Throughout the project, we implemented various Agile practices, including:

  • Sprint Planning: We conducted regular sprint planning sessions to define the scope of work for each iteration and prioritize tasks based on their importance and complexity.
  • Daily Stand-up Meetings: We held daily stand-up meetings to discuss progress, identify obstacles, and coordinate efforts among team members.
  • Continuous Integration and Deployment: We employed continuous integration and deployment practices to ensure that changes to our machine learning models were integrated smoothly and deployed efficiently.
  • Iterative Testing: We performed iterative testing throughout the development process to validate the functionality and performance of our models and identify any issues early on.

Through the effective implementation of Agile practices, we were able to deliver a high-quality machine learning solution that met our project objectives and exceeded stakeholders' expectations.

Flow Chart

Dataset

This project utilizes a dataset containing industry trends, investment insights, and company information.

  • Format: JSON and Excel
  • Link to Raw Data: Excel file
  • Columns: id, entity_type, name, category_code, status, founded_at, closed_at, domain, homepage_url, twitter_username, funding_total_usd, country_code, state_code, city, region, etc.

Data Information:

  • Total Records: 196,553
  • Data Columns: 44
  • Data Types: Object, Integer, Float
  • Missing Values: Present in multiple columns
  • Data Size: Approximately 66.0 MB

This dataset serves as the foundation for building the machine learning model to predict the acquisition status of startups based on various features.

Data Preprocessing

The data preprocessing phase involved several steps, including:

  • Deleted columns providing excessive granularity such as 'region', 'city', 'state_code'
  • Removed redundant columns such as 'id', 'Unnamed: 0.1', 'entity_type'
  • Eliminated irrelevant features such as 'domain', 'homepage_url', 'twitter_username', 'logo_url'
  • Handled duplicate values
  • Removed columns with high null values
  • Dropped instances with missing values such as 'status', 'country_code', 'category_code', 'founded_at'
  • Dropped time-based columns such as 'first_investment_at', 'last_investment_at', 'first_funding_at'
  • Imputed missing values using mean() and mode() methods in numerical columns and categorical columns accordingly such as 'milestones', 'relationships', 'lat', 'lng'

After preprocessing, the DataFrame has the following information:

  • Total columns: 11
  • Non-Null Count: 63585
  • Data types: float64(7), object(4)
  • Memory usage: 7.8 MB

Exploratory Data Analysis (EDA)

Univariate & Bivariate Analysis

The Univaraite & Bivariate Analysis phases involved exploring relationships between variables in the dataset. Key visualizations and analyses conducted during this phase include:

  1. Visualization of the distribution of the Status column, which is the target variable, using a horizontal bar plot.
  2. Visualization of the distribution of Milestones using a histogram.
  3. Exploring the relationship between Status and Milestones using a violin plot.
  4. Visualization of the average funding amount by Status using a bar chart.
  5. Exploring the relationship between Status and Funding Total (USD) using a violin plot.

These visualizations provide insights into how different variables interact with each other and their potential impact on the target variable.

Feature Engineering (FE)

  1. Feature Selection: We performed feature selection to choose the most relevant features for our analysis.
  2. Creation of New Features: We created new features from the existing dataset to enhance predictive power.
  3. Normalization and Scaling: We normalized and scaled numerical features to ensure consistency and comparability.
  4. Encoding Categorical Variables: We encoded categorical variables to represent them numerically for model training.
  5. Feature Engineering Documentation: We documented the entire feature engineering process for transparency and reproducibility.

Creation of New Features from Dataset

We conducted various operations to create new features:

  • Converted the 'founded_at' column to datetime format and extracted the year.
  • Mapped status values to isClosed values and created a new column.
  • Performed Min-Max scaling on selected numerical features.
  • Applied one-hot encoding to 'country_code' and 'category_code' columns.
  • Label encoded the 'status' column for binary classification.

Feature Selection using Mutual Information (MI)

We computed mutual information between features and the target variable to identify top-ranked features for model training.

After conducting comprehensive feature engineering, our dataset comp_df has undergone significant transformations. Initially containing 11 columns consisting of 3 categorical variables and 8 numerical variables, it has now expanded to encompass 32 columns while maintaining its original 4682 rows. All variables within comp_df have been converted to numerical format, making them suitable for analytical operations. Our data frame is ready to embark on the next phase of model construction with confidence.

Model Building

Leading up to the Feature Engineering phase, individual interns diligently prepared their datasets to model startup acquisition statuses. After thorough experimentation and evaluation, three standout models emerged for collaborative refinement by the team.

In the capacity of TEAM C lead, I assumed responsibility for overseeing subsequent tasks until deployment. Initially, our team received directives to explore various models for both binary and multiclass classification:

  • For Binary Classification:
    • We explored Decision Trees.
    • We delved into the intricacies of Support Vector Machines (SVM).
  • For Multiclass Classification:
    • We investigated the applicability of Multinomial Naive Bayes.
    • We explored the potentials of Gradient Boosting.
    • We considered the robustness of Random Forest.
    • We examined the effectiveness of XGBoost.

Following exhaustive analysis and collective deliberation, we meticulously selected one model each for binary and multiclass classification. Our choices, prioritizing accuracy, were SVM for binary classification and XGBoost for multiclass classification.

Model Evaluation

Each model underwent comprehensive evaluation, scrutinizing metrics such as accuracy, precision, recall, and F1-score. This evaluation process resulted in the creation of a detailed classification report for further analysis and refinement.

Machine Learning Pipelines Building

  1. Binary Classification Model:

    We have developed a binary classification model using Random Forest. This model predicts whether a startup will be acquired or not. It analyzes various features of the startup and determines the likelihood of acquisition.

  2. Multiclass Classification Model:

    Similarly, we have constructed a multiclass classification model using an XGBoost classifier. Unlike the binary model, this classifier predicts multiple classes of startup status: Operating, IPO, Acquired, or Closed. It evaluates various factors to categorize startups into these different status categories.

  3. Combining Pipelines:

    Our primary objective is to create three distinct pipelines:

    1. Binary Classification Pipeline:

      This pipeline will encapsulate the process of preparing data, training the Random Forest model, and making predictions on whether a startup will be acquired.

    2. Multiclass Classification Pipeline:

      Similarly, this pipeline will handle data preparation, model training using XGBoost, and predicting the status of startups (Operating, IPO, Acquired, or Closed).

    3. Combined Pipeline:

      The challenge lies in integrating these two models into a single pipeline. We must ensure that the output of the binary classifier is appropriately transformed to serve as input for the multiclass classifier. This combined pipeline will enable us to efficiently predict startup statuses.

  4. Testing and Evaluation:

    After constructing the combined pipeline, extensive testing will be conducted to validate its functionality and accuracy. We will employ various evaluation metrics to assess the performance of the pipeline, ensuring that it reliably predicts startup statuses.

Deployment of Project - Django

Our deployed project leverages Django, a high-level web framework for Python, to provide a user-friendly interface for interacting with our machine-learning model. Users can now make predictions using the model through a web application, without needing to write any code.

With this deployment, we aim to democratize access to machine learning technology, empowering users from various backgrounds to harness the power of predictive analytics for their specific use cases. We have ensured that our deployed project is robust, scalable, and secure, providing a seamless experience for users while maintaining data privacy and integrity.

Thank you for joining us on this journey from development to deployment. We're excited to see how our project will impact the world of machine learning and beyond.

Contributions of the Team

Name Assigned Models Contribution
Adil Ahmed Unar Decision Trees (Binary) and XGBoost (Multiclass) Data Preprocessing, EDA, Feature Engineering, Model Training, Model Deployment, Report
Ashrith Komuravelly Decision Trees (Binary) and Random Forest (Multiclass) Data Preprocessing, EDA, Feature Engineering, Model Training
B Kartheek Decision Trees (Binary) and Multinomial Naïve Bayes (Multiclass) Data Preprocessing, EDA, Feature Engineering, Model Training
Charulatha Support Vector Machines (Binary) and Gradient Boosting (Multiclass) Data Preprocessing, EDA, Feature Engineering, Model Training
Mayuri Sonawane Support Vector Machines (Binary) and Random Forest (Multiclass) Data Preprocessing, EDA, Feature Engineering, Model Training
Pratik Santosh Akole Decision Trees (Binary) and XGBoost (Multiclass) Data Preprocessing, EDA, Feature Engineering, Model Training
Shata Rupendra Support Vector Machines (Binary) and Gradient Boosting (Multiclass) Data Preprocessing, EDA, Feature Engineering, Model Training
Vaibhavi Vijay Support Vector Machines (Binary) and Multinomial Naïve Bayes (Multiclass) Data Preprocessing, EDA, Feature Engineering, Model Training

For the latest updates and contributions, please visit our GitHub repository: Your Repository

About

Predicting Startup Acquisition Statuses using Machine Learning Pipelines!

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published