-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathepilog.tex
74 lines (61 loc) · 4.91 KB
/
epilog.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
\chapter*{Conclusion}
\addcontentsline{toc}{chapter}{Conclusion}
The aim of this thesis was to design and implement a code-analysis tool for the Pandas library that would be capable of
checking common kinds of errors such as access to misspelled or non-existent columns.
The resulting system should have been evaluated through a set of realistic case studies.
Let us recap and summarize to what extent we did that.
We first defined a framework for the Abstract Interpretation analysis of the Dataframe and Series type and then expanded it
to other types such as strings, numbers, lists, etc.
Then we used the framework to implement the Pandalyzer.
The Pandalyzer is a code-analysis tool that uses Abstract Interpretation to interpret Python code with the Pandas
library.
The Pandalyzer is capable of detecting errors such as access to non-existent column, operation on incompatible types,
incorrect function arguments, operations leading to an incorrect state.
It also reports the structure of the output CSV files and is able to accept information about the structure of the input CSV files.
The Pandalyzer is able to work with some amount of uncertainty generated from the user input.
The supported functions include merge, groupby, drop, rename, read\_csv, to\_csv, concat, Dataframe creation,
Series creation, the subscript operator in both get and set contexts and vectorized sums and products, aggregation function
such as mean, sum, first, last, count or head.
As shown in the case studies in the chapter~\ref{ch:evaluation-of-solution}, this set of functions already gives
the user enough flexibility to do many various data manipulation tasks.
Moreover, the Pandalyzer is highly extensible, so implementing support for other Pandas functions is possible.
On the other hand, the Pandalyzer has some limitations.
It now only supports a subset of Python language, so the analyzer will not be able to proceed with the analysis when
it encounters unknown Python construct.
However, it is planned in the Future Work~\ref{sec:future-work} to add support for the rest of the Python language.
The source-code of Pandalyzer can be found in the Pandalyzer GitHub repository~\cite{pandalyzer}, and the source-code
for this thesis can be found in the bachelor-thesis GitHub repository~\cite{bachelor-thesis}.
\section*{Related Work}
\addcontentsline{toc}{section}{Related Work}
The idea to use Abstract Interpretation to analyze programs working with Dataframes was already proposed by~Yungyu~Zhuang
and~Ming-Yang~Lu~\cite{Zhuang:2022:TypeChecking}.
They also created a proof of concept implementation named PDChecker.
However, PDChecker does not have as many checks, does not report output CSV files (via to\_csv() function) and
does not allow for interpreting multiple branches of if-statements and other sources of non-determinism.
Another related subject is the type hints~\cite{python_typing} in Python language.
The language itself does not enforce any typing rules, but we can put type annotations in the code, and analyzers
integrated in the IDE can warn us in case that the types do not match.
Unfortunately, this is not useful for our case, since the type annotations are only able to express that
(for example) the function returns a Pandas Dataframe, not a Pandas Dataframe with a specific column structure.
The Types for Tables~\cite{types_for_tables} article defines the Table API---the definition of a table and operations on
a table similarly as we did it in the chapter~\ref{ch:putting-it-all-together}.
\section*{Future Work}\label{sec:future-work}
\addcontentsline{toc}{section}{Future Work}
There are a lot of plans ahead of us regarding the implementation.
In the future, we would like to support more Pandas operations since Pandas is a large library with many useful
operations.
We would also like to add support for working with Indexes on Dataframes and Series.
The Pandalyzer so far supports a limited subset of Python language constructs.
We chose the most useful language constructs for data manipulation.
However, the plan for the future is to add support for other Python constructs such as Lambdas, Match statements,
For-loops or Classes.
As of now, Pandalyzer is only able to analyze one module, so extending it to be able to analyze multiple modules together
could be also a useful extension.
The Pandalyzer tool could be also extended for other related Python libraries.
A good example is the numpy library for working with vectors, matrices, etc.
The abstraction could be defined as the dimensions of the vectors and matrices.
Another good example of library worth including to the Pandalyzer is the matplotlib library for data visualization.
We could support reporting of what visualizations would be displayed to the user.
This could be implemented in a similar way as the analysis of the to\_csv function in Pandas.
Another field where the Pandalyzer could be extended is its integration to the IDEs such as PyCharm or VS Code.
This could be done using the Language Server Protocol.