# Mortgage Data Extraction System

This project implements a multi-agent system (MAS) using the Google Agent Development Kit (ADK) to automate the extraction of data from mortgage documents.

## Features

* **Document Processing:** Splits documents into manageable chunks.
* **Data Extraction:** Utilizes Google Cloud Document AI for Optical Character Recognition (OCR) and data extraction.
* **Data Validation:** Validates extracted data against a predefined schema.
* **Orchestration:** A `DocumentProcessorAgent` coordinates the workflow between different agents.
* **Deployment:** Includes scripts for deploying the system to Google Cloud Vertex AI.
* **Testing:** Comprehensive unit and integration tests for agents and tools.

## Project Structure

```
~/ias/
├── data_extraction_system/
│   ├── __init__.py
│   ├── main.py
│   ├── agents/
│   │   ├── __init__.py
│   │   ├── document_processor_agent.py
│   │   ├── data_extractor_agent.py
│   │   ├── validation_agent.py
│   ├── tools/
│   │   ├── __init__.py
│   │   ├── document_splitter.py
│   │   ├── ocr_engine.py
│   │   ├── data_validation.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── extraction_schema.py
│   ├── deployment/
│   │   ├── deploy.py
│   │   └── test_deployment.py
│   ├── tests/
│   │   ├── __init__.py
│   │   ├── test_document_processor_agent.py
│   │   ├── test_data_extractor_agent.py
│   │   ├── test_validation_agent.py
│   │   ├── test_ocr_engine.py
│   │   ├── test_data_validation.py
│   └── requirements.txt
```

## Setup and Installation

1. **Clone the Repository:**

    ```bash
    git clone <repository-url>
    cd ias/data_extraction_system
    ```

2. **Install Dependencies:**

    ```bash
    pip install -r requirements.txt
    ```

3. **Google Cloud Setup:**
    * **Enable APIs:** Ensure the following Google Cloud APIs are enabled for your project:
        * Vertex AI API
        * Document AI API
    * **Authentication:** Set up your Google Cloud credentials. The recommended way is to use Application Default Credentials (ADC) by running:

        ```bash
        gcloud auth application-default login
        ```

    * **Environment Variables:** Create a `.env` file in the `data_extraction_system/` directory with the following content:

        ```
        GOOGLE_CLOUD_PROJECT_ID=your-gcp-project-id
        DOCUMENT_AI_LOCATION=your-document-ai-location  # e.g., us-central1
        DOCUMENT_AI_PROCESSOR_ID=your-document-ai-processor-id
        SAMPLE_DOCUMENT_PATH=path/to/your/sample/document.pdf
        VERTEX_AI_LOCATION=your-vertex-ai-location # e.g., us-central1
        ```

        * Replace `your-gcp-project-id`, `your-document-ai-location`, `your-document-ai-processor-id`, `path/to/your/sample/document.pdf`, and `your-vertex-ai-location` with your actual values.
        * You need to create a Document AI processor (e.g., a Form Parser or Document OCR processor) and get its ID.

## Running the System

1. **Execute the Main Script:**

    ```bash
    python data_extraction_system/main.py
    ```

    This will start the data extraction process using the document specified in `SAMPLE_DOCUMENT_PATH`.

## Running Tests

1. **Run Unit Tests:**

    ```bash
    python -m unittest discover data_extraction_system/tests
    ```

## Deployment to Vertex AI

1. **Deploy the Model:**
    The `data_extraction_system/deployment/deploy.py` script provides a placeholder for deploying your trained model to Vertex AI. You will need to:
    * Train a model using Vertex AI or another platform.
    * Upload the trained model to Vertex AI Model Registry.
    * Modify the `deploy.py` script to reference your uploaded model and configure the endpoint settings.

    ```bash
    python data_extraction_system/deployment/deploy.py
    ```

2. **Test Deployment:**
    The `data_extraction_system/deployment/test_deployment.py` script can be used to test the deployment script's functionality (it mocks the actual deployment process).

    ```bash
    python data_extraction_system/deployment/test_deployment.py
    ```

## Important Notes

* **Document AI Processor:** Ensure you have a Document AI processor set up and its ID is correctly configured in the `.env` file. The type of processor (e.g., Form Parser, Document OCR) will influence the extraction capabilities.
* **Document Format:** The `document_splitter.py` currently assumes text-based documents. For PDF or image-based documents, you would need to integrate a PDF parsing library (like `PyPDF2`) and ensure the `ocr_engine.py` is correctly configured to handle these formats.
* **Error Handling:** The provided code includes basic error handling. For production environments, more robust error handling and logging mechanisms should be implemented.
