NSA Data Scraping & Extraction Scripts¶
The scripts/nsa directory contains a suite of Python scripts designed for scraping, processing, and storing
document data from the NSA archive.
Below is an overview of the purpose, requirements, and usage of each script.
Requirements¶
-
MongoDB The scripts rely on a MongoDB instance to store and manage the scraped data. The scripts will use the
nsadatabase under the provided URI. You can specify the URI of the MongoDB instance using the--db-uriargument in the scripts or by setting theDB_URIenvironment variable. -
Proxy A working proxy is required for certain scripts to download data reliably. Provide the proxy address in the following format:
http://user:password@address:port. -
Data Storage Path Configure
NSA_DATA_PATHin thejuddges.settingsmodule to specify the directory for storing scraped and processed data.
Quick Start¶
The recommended way to run the complete scraping pipeline is using the full_procedure.py script:
full_procedure.py¶
- Purpose: Orchestrates the entire data scraping and processing pipeline by running all necessary scripts in the correct order. It handles retries for critical steps and provides a convenient way to execute the complete workflow with a single command. It is recommended to use cleanup iterations to remove duplicates and scrap data that was not scraped correctly due to errors in previous step. Scripts will use already scraped data from the database, so you don't need to set start date.
- Usage:
- Arguments:
| Argument | Description | Default |
|---|---|---|
--proxy-address |
Proxy address for scraping (required) | |
--db-uri |
MongoDB URI (required) | |
--start-date |
Start date for scraping (YYYY-MM-DD) | 1981-01-01 |
--end-date |
End date for scraping (YYYY-MM-DD) | Yesterday’s date in Poland |
--n-jobs |
Number of parallel workers | 25 |
--scrap-dates-iterations |
Number of iterations to scrap dates | 1 |
--cleanup-iterations |
Number of cleanup iterations to perform | 1 |
--log-file |
Path for the log file | {PROJECT_ROOT}/logs/nsa/full_procedure_YYYYMMDD_HHMMSS.log |
--find-remove-changed-document-lists-iterations |
Number of iterations to find and remove changed document lists | 0 |
--redownload-days-back |
Days back to redownload pages from. Set to 0 to disable redownloading. |
730 (2 years) |
- Pipeline Steps:
- Runs
find_remove_changed_document_lists.pyto find and remove changed document lists. (Disabled by default, use--find-remove-changed-document-lists-iterationsto enable.) - Runs
scrap_documents_list.pyto get initial document list - For each cleanup iteration:
- Runs
drop_dates_with_duplicated_documents.pyto remove duplicates - Re-runs
scrap_documents_list.pyto update the document list
- Runs
- Runs
drop_docs_to_redownload.pyto remove documents from the given number of days back to download again. Setredownload_days_backto0to disable this step. - Runs
download_document_pages.pyto fetch document pages - For each cleanup iteration:
- Runs
drop_duplicated_document_pages.pyto remove duplicate pages - Re-runs
download_document_pages.pyto update pages
- Runs
- Runs final processing:
save_pages_from_db_to_file.pyto export pages to filesextract_data_from_pages.pyto process the data
Script Descriptions and Order of Execution¶
0. find_remove_changed_document_lists.py¶
- Purpose: Finds and removes changed document lists in the
datescollection in MongoDB. It acquire number of documents for each date and saves it to thedates_num_docscollection. Then it compares number of documents for each date with the previous number of documents and if they are changed, it removes the date from thedatescollection. It is recommended to run this script only to update the dataset fully. Do not run it in the middle of the pipeline as it will remove newly scraped document lists. To enable this step infull_procedure.py, setfind_remove_changed_document_lists_iterationsto positive number eg.1. It is disabled by default ie.0. - Usage:
- Arguments:
| Argument | Description | Default |
|--------------------------|------------------------------------------------------|----------------|
|
--proxy-address| Proxy address for scraping (required) | | |--db-uri| MongoDB URI (required) | | |--start-date| Start date for scraping (YYYY-MM-DD) |1981-01-01| |--end-date| End date for scraping (YYYY-MM-DD) | Yesterday's date in Poland | |--min-interval-between-checks| Minimum interval between checks in days |30| |--num-elements-to-check| Number of elements to check for each date |3| |--max-checks-per-date| Maximum number of checks per date if the number of documents is the same |3| |--n-jobs| Number of parallel workers |30| |--log-file| Path for the log file (None to disable) | None |
1. scrap_documents_list.py¶
- Purpose: Scrapes a list of documents and from the NSA website for a specified date range.
- Usage:
- Arguments:
| Argument | Description | Default |
|---|---|---|
--proxy-address |
Proxy address for scraping (required). | |
--db-uri |
MongoDB URI. | |
--start-date |
Start date for scraping (YYYY-MM-DD). | 1981-01-01 |
--end-date |
End date for scraping (YYYY-MM-DD). (last day will be included) | Yesterday’s date in Poland |
--n-jobs |
Number of parallel workers. | 30 |
--log-file |
Path for the log file (None to disable) | None |
- Output: Saves the scraped document list in MongoDB (
datescollection) and asdocuments.jsonindata/datasets/nsa.
2. (optional) drop_dates_with_duplicated_documents.py¶
- Purpose: Removes duplicate document entries from the
datescollection in MongoDB. If documents are duplicated, the script removes whole dates for which duplicates are found. - Note: You should to rerun
scrap_documents_list.pyafter running this script to updatedocuments.json. - Usage:
- Arguments:
| Argument | Description | Default |
|---|---|---|
--db-uri |
MongoDB URI. | |
--log-file |
Path for the log file (None to disable) | None |
- Output: Cleans up the
datescollection by deleting dates with duplicate document entries.
3. (optional) drop_docs_to_redownload.py¶
- Purpose: Removes documents from the given number of days back to download again. It is done due to the fact that some documents are changed. To enable this step in
full_procedure.py, setredownload_days_backto positive number eg.720, to disable set it to0. - Usage:
- Arguments:
| Argument | Description | Default |
|------------------|------------------------------------------------|---------|
|
--db-uri| MongoDB URI. | | |--redownload-days-back| Days back to redownload pages from. |720(2 years) | |--log-file| Path for the log file (None to disable) | None |
4. download_document_pages.py¶
- Purpose: Downloads document pages (raw HTML) using IDs retrieved from the
documents.jsonfile. - Usage:
- Arguments:
| Argument | Description | Default |
|---|---|---|
--proxy-address |
Proxy address for scraping (required). | |
--db-uri |
MongoDB URI. | |
--n-jobs |
Number of parallel workers. | 25 |
--log-file |
Path for the log file (None to disable) | None |
- Output: Stores downloaded pages in the
document_pagescollection in MongoDB. Errors are stored in thedocument_pages_errorscollection.
5. (optional) drop_duplicated_document_pages.py¶
- Purpose: Identifies and removes duplicate pages from the
document_pagescollection in MongoDB. - Note: You should to rerun
download_document_pages.pyafter running this script to update the collection. - Usage:
- Arguments:
| Argument | Description | Default |
|---|---|---|
--db-uri |
MongoDB URI. | |
--log-file |
Path for the log file (None to disable) | None |
- Output: Cleans up the
document_pagescollection by deleting duplicate pages.
6. save_pages_from_db_to_file.py¶
- Purpose: Exports document pages and errors from MongoDB to Parquet files for further processing.
- Usage:
- Arguments:
| Argument | Description | Default |
|---|---|---|
--db-uri |
MongoDB URI. | |
--log-file |
Path for the log file (None to disable) | None |
- Output: Saves pages to
pages/pages_chunk_*.parquetindata/datasets/nsa.
7. extract_data_from_pages.py¶
- Purpose: Extracts structured data from downloaded document pages.
-
Usage:
-
Arguments:
| Argument | Description | Default |
|---|---|---|
--n-jobs |
Number of parallel workers. | 10 |
--log-file |
Path for the log file (None to disable) | None |
- Output: Saves processed data in Parquet files within
NSA_DATA_PATH/dataset.