Handling very large files with openpyxl in Python

Sometimes, we can have huge Excel or CSV files with many rows and columns. In this case, loading such files directly in Python can cause the interpreter to crash. Thus, we can make use of the openpyxl library in Python to handle these massive amounts of files. Therefore, in this tutorial, we will learn to handle large files with openpyxl in Python.

Loading large files using the read-only mode

Openpyxl is a library used to read and write excel files in Python. The common functions available in openpyxl won’t be able to handle reading and writing extremely large files. But, in openpyxl there are two modes available through which we can read or write such large files in nearly constant memory consumption.

In this case, our dataset contains 1192 rows and is of size 1MB. Using normal functions in openpyxl won’t be enough to load this huge amount of dataset. So we open the workbook in read_only mode and print every cell of every row in the current worksheet.

from openpyxl import load_workbook
wb = load_workbook(filename='dataset.xlsx', read_only=True)
ws=wb.active

for row in ws.rows:
 
    for cell in row:
        print(cell.value)

wb.close()

Output:

https://www.udemy.com/trading-forex-lanalyse-des-volumes/
70
4
3
13
All Levels
0.97
2
2017-06-29T21:32:21Z
Business Finance
1239206
Win 90% of Trades; Quantitative Investments Models in Excel
https://www.udemy.com/quantitative-investments/
200
4
1
16
All Levels
0.97
1.5
2017-06-28T15:09:21Z
Business Finance
625670
Technical Charting
https://www.udemy.com/technical-charting/
20
4
1
6
All Levels
0.97
1
2015-11-02T20:23:54Z
Business Finance
1153854
Working Capital assessment for bankers, credit analysts
https://www.udemy.com/working-capital-assessment-for-bankers-credit-analysts/
20
4
0
29


Slicing and extending data from huge Excel files using openpyxl

We will import the range_boundaries function which generates a tuple of cell boundaries from a given range mentioned in the parameters: (min_col, min_row, max_col, max_row). Next, we use the iter_rows () function to read and iterate through every cell corresponding to the rows as per the coordinates mentioned in the parameters. We will then slice the number of columns only up to column E by indexing each row_cell. We can also apply the extend function to extend the data generated above. We will add the value from column B till the end, to every row. Lastly, we will print the output data.

from openpyxl.utils import range_boundaries
from openpyxl import load_workbook
wb = load_workbook(filename='dataset.xlsx')
ws=wb.active
min_col, min_row, max_col, max_row = range_boundaries('A1:E3')

for row_cells in ws.iter_rows(min_col=min_col, min_row=min_row,
                              max_col=max_col, max_row=max_row):

    # Slice Column Values up to E
    data = [cell.value for cell in row_cells[:5]]

    # Extend data with sliced Column Values from B up to End
    data.extend([cell.value for cell in row_cells[2:]])

    print(data)

Output:

['course_id', 'course_title', 'url', 'price', 'num_subscribers', 'url', 'price', 'num_subscribers']
[49798, 'Bitcoin or How I Learned to Stop Worrying and Love Crypto', 'https://www.udemy.com/bitcoin-or-how-i-learned-to-stop-worrying-and-love-crypto/', 0, 65576, 'https://www.udemy.com/bitcoin-or-how-i-learned-to-stop-worrying-and-love-crypto/', 0, 65576]
[48841, 'Accounting in 60 Minutes - A Brief Introduction', 'https://www.udemy.com/accounting-in-60-minutes-a-brief-introduction/', 0, 56659, 'https://www.udemy.com/accounting-in-60-minutes-a-brief-introduction/', 0, 56659]

Thus we have reached the end of this tutorial on how to handle large files in Python using openpyxl. To learn more about openpyxl, click on the following link: How to append data in excel using openpyxl in Python

Leave a Reply

Your email address will not be published. Required fields are marked *